Re: [sqlite] Prohibitive indexing time with large tables

Marcus Grimm Sun, 14 Feb 2010 09:21:05 -0800

> Marcus,
>
> although increasing cache size is a good method, it may sometimes give
> unpredictable results (in terms of performance).


hm... why ?

> I looked at the vdbe code (EXPLAIN CREATE INDEX ... ) of the index
> creation
> and it seems like there is no special sorting algorithm (CMIIW please).

I guess you are right, doesn't make sense to assume a sorting. Sorry.

> Excluding all "make once" lines from vdbe code left this loop lines:
>
> "19"    "Rowid"
> "20"    "Column"
> "21"    "MakeRecord"
> "22"    "IdxInsert"
> "23"    "Next"  .. 19
>
> that leads to conlusion that either single "indexed" insert or a step from
> index creation is actually just an insert of data inside B-tree without
> special preprocessing.

But to insert that index data sqlite needs to at least partly
compare existing values with the one that it is about to insert.
Here, I guess, comes the cache into account: When sqlite is not able
to hold the running index in memory and/or the entries to be inserted
are unsorted it has to free and reload pages randomly.
Anyway, I should better stop guessing and leave this to people
who are more in the sqlite internal code... :-)

>
> So we just consider mass insert into B-tree. As I suppose the cost of such
> insert is sometimes low, sometimes high, depends on different factors
> including occasional need for rebalancing. As long as all operations are
> in
> memory and all pages are "dirty" (not saved to disk) the results looks
> very
> promising. I tried to change cache size to 500,000 (x1024) for the same

This is appx. 500MB cache, why not trying with 2,000,000 cache size ? :-)

> base
> (10M records, 100 bytes record size) and adding from scrach was very fast
> until about 3M records (taskman showing about 800Mb of memory used at the
> moment) and dropped significally after that. Will this bad point be 3M or
> not 3M depends on the size of data used in index. So if any search inside
> the tree is a parsing through the cached pages (either by sqlite or
> underlying file system) and any page write operation involved in new tree
> operations are dirty pages that will be write later, there is no problem.
> Otherwise there is. The other problem is when total datasize exceeds 32bit
> address space, the cache stops giving significant improvement.

Yes, I was assuming that the index will more or less
fit into RAM. At my experiments it appears to be the case but I had
an index on the hash value not on the text column itselve, thus reducing
the memory required.

>
> I know that quicksorting I suggested have disadvantages, but in some way
> its
> results is more predictable. For example, we postpone any sqlite writing
> operation until the data is ready. Unfortunately, for index data size
> bigger
> than 32bit address space the file-mapping doesn't work, but it can be
> changed to simple Seek acces and at least the file system can do a decent
> job keeping frequentely access data (currently) in cache.
>
> Marcus, I didn't understand your comment about frequent reloading. As I
> read
> from the initial post, he has large data chunk of unsorted data, 20M
> records
> that needed to be accessed on a daily basis. So in his case any way that
> lead to data placed inside sqlite db indexed properly is ok. The problem
> is
> that on a daily basis couple of hours is a big price to pay. Jerome can
> correct me, but he still didn't add anything new to this discussion, hope
> he
> will.

His original question was about the todo list found at the wiki.
Not sure if any of the core developer will answer, but I would
be interested as well...

Anyway, thanks for discussing.

best

Marcus

>
> Nice approach with the hash and collision resolving inside query, will
> keep
> it in mind for future use )
>
> Max
>
> On Sun, Feb 14, 2010 at 4:03 PM, Marcus Grimm
> <mgr...@medcom-online.de>wrote:
>
>> Just for my curiosity:
>>
>> Have you tried to increase the cache as already suggested ?
>>
>> I ran into a similar problem while playing with a artificial
>> test database with appx. 10 Mio records and creating an index.
>> Without drastically increasing the cache size sqlite appears
>> not to be able to create an index on a certain field - it
>> never stops within, say,  2 hours.
>> Sounds quite dissapointing keeping in mind how fast sqlite usually
>> operates, but it becomes clear when we consider that sqlite
>> is told to use only a few MB of memory per default. ALso your
>> quicksort mentioned below will be very slow if he needs
>> reload data from disk all the time.
>> So in my case it helps to tell sqlite to use appx 500MB memory
>> via pragma cache_size.
>>
>> Please note that if you create an index on a text field sqlite
>> will basically make a copy of the hole table in your case.
>>
>> Depending how you are using that Text field in a filter statement
>> you may consider adding an integer hash (e.g. CRC32) entry in your
>> table and create an index on that and slightly change your queries
>> like:
>> SELECT * From TestTable WHERE TextHas=12312 AND Text='Text to search';
>> Unfortunately that works only for that simple form of "=" statements.
>>
>> Marcus
>>
>> > Jerome,
>> >
>> > It's an an interesting challenge, thanks for the post
>> > I tried to research more and did some tests.
>> > My test database contains a table with 10,000,000 records of the text
>> 100
>> > chars in length
>> >
>> > CREATE TABLE [TestTable] (
>> > [Id] INTEGER PRIMARY KEY AUTOINCREMENT,
>> > [Text] TEXT
>> > )
>> >
>> > I suppose your data is different, but at least this one has more than
>> 1M
>> > records and not so small record size. My final base was about 1G in
>> size.
>> >
>> > The default index creation
>> >
>> > CREATE INDEX [idx_TestTable] ON [TestTable] ([Text] )
>> >
>> > took very long time (about two hours or so).
>> > Having the index before data insert didn't change anything, first
>> records
>> > had a good speed of append (about 10,000 records/sec significantly
>> slowing
>> > when the number of records exceeded 1-2M).
>> >
>> > So there was no way to ignore some external files approach and I did
>> it
>> > filing memory-mapped file with the contents of Text field, while
>> filling
>> > also array in memory saving offsets and length of the strings. After
>> that
>> > quicksort of that offset array took about 5 minutes and inserting the
>> > textes
>> > in sorted order to sqlite base other 5 minutes, so about 10 minutes it
>> > total. First 5 minutes was possible since we exchange only offsets,
>> not
>> > data
>> > and other 5 minutes since inserting sorted data into B -tree is really
>> a
>> > fast operation.
>> >
>> > Although real life data can be different, the things that worked might
>> be
>> > the same. So anyone can use this method occupying not more than the
>> sqlite
>> > file itself for temporary storage and ending up with the data in
>> necessary
>> > order inside sqlite database after that. I know that there are many
>> things
>> > to take into account like the memory size and the size of the actual
>> data
>> > but it's just a proof of concept.
>> >
>> > Also I think sqlite could use the same approach internally for
>> creating
>> > index for existing data. The db is probably already exclusively locked
>> > while
>> > CREATE INDEX is in process so having temporary array accessing and
>> storing
>> > for example file offsets of particular records should not be a
>> problem.
>> >
>> > Max
>> >
>> > On Sat, Feb 13, 2010 at 5:00 PM, Jérôme Magnin
>> > <jerome.mag...@bluewin.ch>wrote:
>> >
>> >> Hi,
>> >>
>> >> This post is a question directed to D. Richard Hipp :
>> >>
>> >> I have been using SQLite for 3 years in a records linkage software
>> >> package I have developed. My organization recently had to use the
>> >> package to perform daily linkage of large administrative governmental
>> >> registers (up to 20 million records each). During the linkage
>> process,
>> >> auxiliary tables containing records "fingerprints" must be created,
>> and
>> >> two columns be indexed in them.
>> >>
>> >> SQLite provides decent indexing times for such tables with up to 1M
>> >> rows, but beyond this size the (already well-discussed) critical
>> slowing
>> >> down of indexing performance due to disk nonlocality kicks in. The
>> only
>> >> workaround I could imagine to ease the problem would be to duplicate
>> the
>> >> auxiliary table and load pre-sorted rows in it, with sort key being
>> the
>> >> column I intend to index on. This is unfortunately too costly in
>> terms
>> >> of disk space used.
>> >>
>> >> I therefore had to develop an alternate datasource type (based on
>> flat
>> >> files) in order for my system to be able to efficiently handle big
>> >> files. Which is a pity since SQLite provides great features I still
>> >> would like to be able to rely upon when dealing with large files.
>> >>
>> >> Now my question: in the "To do" section of SQLite's wiki, you mention
>> >> "Develop a new sort implementation that does much less disk seeking.
>> Use
>> >> to improve indexing performance on large tables.". I have been seeing
>> >> this entry for 3 years but nothing concrete seems to have happened on
>> >> this front in the meantime. Do you have any idea about if (and when)
>> you
>> >> will work on this in the future ? Can I nourish reasonable hopes that
>> >> the situation will improve on this aspect within the next 2 years ?
>> This
>> >> really has become a critical factor for me to decide on my future
>> >> development strategy with this product.
>> >>
>> >> Thanks in advance for any useful information.
>> >>
>> >> Jerome
>> >>
>> >>
>> >> _______________________________________________
>> >> sqlite-users mailing list
>> >> sqlite-users@sqlite.org
>> >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>> >>
>> > _______________________________________________
>> > sqlite-users mailing list
>> > sqlite-users@sqlite.org
>> > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>> >
>>
>>
>> _______________________________________________
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>
>


_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Prohibitive indexing time with large tables

Reply via email to