Re: [sqlite] Prohibitive indexing time with large tables

Marcus Grimm Sun, 14 Feb 2010 05:04:08 -0800

Just for my curiosity:

Have you tried to increase the cache as already suggested ?


I ran into a similar problem while playing with a artificial
test database with appx. 10 Mio records and creating an index.
Without drastically increasing the cache size sqlite appears
not to be able to create an index on a certain field - it
never stops within, say,  2 hours.
Sounds quite dissapointing keeping in mind how fast sqlite usually
operates, but it becomes clear when we consider that sqlite
is told to use only a few MB of memory per default. ALso your
quicksort mentioned below will be very slow if he needs
reload data from disk all the time.
So in my case it helps to tell sqlite to use appx 500MB memory
via pragma cache_size.

Please note that if you create an index on a text field sqlite
will basically make a copy of the hole table in your case.

Depending how you are using that Text field in a filter statement
you may consider adding an integer hash (e.g. CRC32) entry in your
table and create an index on that and slightly change your queries
like:
SELECT * From TestTable WHERE TextHas=12312 AND Text='Text to search';
Unfortunately that works only for that simple form of "=" statements.

Marcus

> Jerome,
>
> It's an an interesting challenge, thanks for the post
> I tried to research more and did some tests.
> My test database contains a table with 10,000,000 records of the text 100
> chars in length
>
> CREATE TABLE [TestTable] (
> [Id] INTEGER PRIMARY KEY AUTOINCREMENT,
> [Text] TEXT
> )
>
> I suppose your data is different, but at least this one has more than 1M
> records and not so small record size. My final base was about 1G in size.
>
> The default index creation
>
> CREATE INDEX [idx_TestTable] ON [TestTable] ([Text] )
>
> took very long time (about two hours or so).
> Having the index before data insert didn't change anything, first records
> had a good speed of append (about 10,000 records/sec significantly slowing
> when the number of records exceeded 1-2M).
>
> So there was no way to ignore some external files approach and I did it
> filing memory-mapped file with the contents of Text field, while filling
> also array in memory saving offsets and length of the strings. After that
> quicksort of that offset array took about 5 minutes and inserting the
> textes
> in sorted order to sqlite base other 5 minutes, so about 10 minutes it
> total. First 5 minutes was possible since we exchange only offsets, not
> data
> and other 5 minutes since inserting sorted data into B -tree is really a
> fast operation.
>
> Although real life data can be different, the things that worked might be
> the same. So anyone can use this method occupying not more than the sqlite
> file itself for temporary storage and ending up with the data in necessary
> order inside sqlite database after that. I know that there are many things
> to take into account like the memory size and the size of the actual data
> but it's just a proof of concept.
>
> Also I think sqlite could use the same approach internally for creating
> index for existing data. The db is probably already exclusively locked
> while
> CREATE INDEX is in process so having temporary array accessing and storing
> for example file offsets of particular records should not be a problem.
>
> Max
>
> On Sat, Feb 13, 2010 at 5:00 PM, Jérôme Magnin
> <jerome.mag...@bluewin.ch>wrote:
>
>> Hi,
>>
>> This post is a question directed to D. Richard Hipp :
>>
>> I have been using SQLite for 3 years in a records linkage software
>> package I have developed. My organization recently had to use the
>> package to perform daily linkage of large administrative governmental
>> registers (up to 20 million records each). During the linkage process,
>> auxiliary tables containing records "fingerprints" must be created, and
>> two columns be indexed in them.
>>
>> SQLite provides decent indexing times for such tables with up to 1M
>> rows, but beyond this size the (already well-discussed) critical slowing
>> down of indexing performance due to disk nonlocality kicks in. The only
>> workaround I could imagine to ease the problem would be to duplicate the
>> auxiliary table and load pre-sorted rows in it, with sort key being the
>> column I intend to index on. This is unfortunately too costly in terms
>> of disk space used.
>>
>> I therefore had to develop an alternate datasource type (based on flat
>> files) in order for my system to be able to efficiently handle big
>> files. Which is a pity since SQLite provides great features I still
>> would like to be able to rely upon when dealing with large files.
>>
>> Now my question: in the "To do" section of SQLite's wiki, you mention
>> "Develop a new sort implementation that does much less disk seeking. Use
>> to improve indexing performance on large tables.". I have been seeing
>> this entry for 3 years but nothing concrete seems to have happened on
>> this front in the meantime. Do you have any idea about if (and when) you
>> will work on this in the future ? Can I nourish reasonable hopes that
>> the situation will improve on this aspect within the next 2 years ? This
>> really has become a critical factor for me to decide on my future
>> development strategy with this product.
>>
>> Thanks in advance for any useful information.
>>
>> Jerome
>>
>>
>> _______________________________________________
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>


_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Prohibitive indexing time with large tables

Reply via email to