Re: [sqlite] Prohibitive indexing time with large tables

Max Vlasov Sun, 14 Feb 2010 08:30:08 -0800

Marcus,

although increasing cache size is a good method, it may sometimes give
unpredictable results (in terms of performance).
I looked at the vdbe code (EXPLAIN CREATE INDEX ... ) of the index creation
and it seems like there is no special sorting algorithm (CMIIW please).
Excluding all "make once" lines from vdbe code left this loop lines:


"19"    "Rowid"
"20"    "Column"
"21"    "MakeRecord"
"22"    "IdxInsert"
"23"    "Next"  .. 19

that leads to conlusion that either single "indexed" insert or a step from
index creation is actually just an insert of data inside B-tree without
special preprocessing.

So we just consider mass insert into B-tree. As I suppose the cost of such
insert is sometimes low, sometimes high, depends on different factors
including occasional need for rebalancing. As long as all operations are in
memory and all pages are "dirty" (not saved to disk) the results looks very
promising. I tried to change cache size to 500,000 (x1024) for the same base
(10M records, 100 bytes record size) and adding from scrach was very fast
until about 3M records (taskman showing about 800Mb of memory used at the
moment) and dropped significally after that. Will this bad point be 3M or
not 3M depends on the size of data used in index. So if any search inside
the tree is a parsing through the cached pages (either by sqlite or
underlying file system) and any page write operation involved in new tree
operations are dirty pages that will be write later, there is no problem.
Otherwise there is. The other problem is when total datasize exceeds 32bit
address space, the cache stops giving significant improvement.

I know that quicksorting I suggested have disadvantages, but in some way its
results is more predictable. For example, we postpone any sqlite writing
operation until the data is ready. Unfortunately, for index data size bigger
than 32bit address space the file-mapping doesn't work, but it can be
changed to simple Seek acces and at least the file system can do a decent
job keeping frequentely access data (currently) in cache.

Marcus, I didn't understand your comment about frequent reloading. As I read
from the initial post, he has large data chunk of unsorted data, 20M records
that needed to be accessed on a daily basis. So in his case any way that
lead to data placed inside sqlite db indexed properly is ok. The problem is
that on a daily basis couple of hours is a big price to pay. Jerome can
correct me, but he still didn't add anything new to this discussion, hope he
will.

Nice approach with the hash and collision resolving inside query, will keep
it in mind for future use )

Max

On Sun, Feb 14, 2010 at 4:03 PM, Marcus Grimm <mgr...@medcom-online.de>wrote:

> Just for my curiosity:
>
> Have you tried to increase the cache as already suggested ?
>
> I ran into a similar problem while playing with a artificial
> test database with appx. 10 Mio records and creating an index.
> Without drastically increasing the cache size sqlite appears
> not to be able to create an index on a certain field - it
> never stops within, say,  2 hours.
> Sounds quite dissapointing keeping in mind how fast sqlite usually
> operates, but it becomes clear when we consider that sqlite
> is told to use only a few MB of memory per default. ALso your
> quicksort mentioned below will be very slow if he needs
> reload data from disk all the time.
> So in my case it helps to tell sqlite to use appx 500MB memory
> via pragma cache_size.
>
> Please note that if you create an index on a text field sqlite
> will basically make a copy of the hole table in your case.
>
> Depending how you are using that Text field in a filter statement
> you may consider adding an integer hash (e.g. CRC32) entry in your
> table and create an index on that and slightly change your queries
> like:
> SELECT * From TestTable WHERE TextHas=12312 AND Text='Text to search';
> Unfortunately that works only for that simple form of "=" statements.
>
> Marcus
>
> > Jerome,
> >
> > It's an an interesting challenge, thanks for the post
> > I tried to research more and did some tests.
> > My test database contains a table with 10,000,000 records of the text 100
> > chars in length
> >
> > CREATE TABLE [TestTable] (
> > [Id] INTEGER PRIMARY KEY AUTOINCREMENT,
> > [Text] TEXT
> > )
> >
> > I suppose your data is different, but at least this one has more than 1M
> > records and not so small record size. My final base was about 1G in size.
> >
> > The default index creation
> >
> > CREATE INDEX [idx_TestTable] ON [TestTable] ([Text] )
> >
> > took very long time (about two hours or so).
> > Having the index before data insert didn't change anything, first records
> > had a good speed of append (about 10,000 records/sec significantly
> slowing
> > when the number of records exceeded 1-2M).
> >
> > So there was no way to ignore some external files approach and I did it
> > filing memory-mapped file with the contents of Text field, while filling
> > also array in memory saving offsets and length of the strings. After that
> > quicksort of that offset array took about 5 minutes and inserting the
> > textes
> > in sorted order to sqlite base other 5 minutes, so about 10 minutes it
> > total. First 5 minutes was possible since we exchange only offsets, not
> > data
> > and other 5 minutes since inserting sorted data into B -tree is really a
> > fast operation.
> >
> > Although real life data can be different, the things that worked might be
> > the same. So anyone can use this method occupying not more than the
> sqlite
> > file itself for temporary storage and ending up with the data in
> necessary
> > order inside sqlite database after that. I know that there are many
> things
> > to take into account like the memory size and the size of the actual data
> > but it's just a proof of concept.
> >
> > Also I think sqlite could use the same approach internally for creating
> > index for existing data. The db is probably already exclusively locked
> > while
> > CREATE INDEX is in process so having temporary array accessing and
> storing
> > for example file offsets of particular records should not be a problem.
> >
> > Max
> >
> > On Sat, Feb 13, 2010 at 5:00 PM, Jérôme Magnin
> > <jerome.mag...@bluewin.ch>wrote:
> >
> >> Hi,
> >>
> >> This post is a question directed to D. Richard Hipp :
> >>
> >> I have been using SQLite for 3 years in a records linkage software
> >> package I have developed. My organization recently had to use the
> >> package to perform daily linkage of large administrative governmental
> >> registers (up to 20 million records each). During the linkage process,
> >> auxiliary tables containing records "fingerprints" must be created, and
> >> two columns be indexed in them.
> >>
> >> SQLite provides decent indexing times for such tables with up to 1M
> >> rows, but beyond this size the (already well-discussed) critical slowing
> >> down of indexing performance due to disk nonlocality kicks in. The only
> >> workaround I could imagine to ease the problem would be to duplicate the
> >> auxiliary table and load pre-sorted rows in it, with sort key being the
> >> column I intend to index on. This is unfortunately too costly in terms
> >> of disk space used.
> >>
> >> I therefore had to develop an alternate datasource type (based on flat
> >> files) in order for my system to be able to efficiently handle big
> >> files. Which is a pity since SQLite provides great features I still
> >> would like to be able to rely upon when dealing with large files.
> >>
> >> Now my question: in the "To do" section of SQLite's wiki, you mention
> >> "Develop a new sort implementation that does much less disk seeking. Use
> >> to improve indexing performance on large tables.". I have been seeing
> >> this entry for 3 years but nothing concrete seems to have happened on
> >> this front in the meantime. Do you have any idea about if (and when) you
> >> will work on this in the future ? Can I nourish reasonable hopes that
> >> the situation will improve on this aspect within the next 2 years ? This
> >> really has become a critical factor for me to decide on my future
> >> development strategy with this product.
> >>
> >> Thanks in advance for any useful information.
> >>
> >> Jerome
> >>
> >>
> >> _______________________________________________
> >> sqlite-users mailing list
> >> sqlite-users@sqlite.org
> >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >>
> > _______________________________________________
> > sqlite-users mailing list
> > sqlite-users@sqlite.org
> > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >
>
>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Prohibitive indexing time with large tables

Reply via email to