Marcus, although increasing cache size is a good method, it may sometimes give unpredictable results (in terms of performance). I looked at the vdbe code (EXPLAIN CREATE INDEX ... ) of the index creation and it seems like there is no special sorting algorithm (CMIIW please). Excluding all "make once" lines from vdbe code left this loop lines:
"19" "Rowid" "20" "Column" "21" "MakeRecord" "22" "IdxInsert" "23" "Next" .. 19 that leads to conlusion that either single "indexed" insert or a step from index creation is actually just an insert of data inside B-tree without special preprocessing. So we just consider mass insert into B-tree. As I suppose the cost of such insert is sometimes low, sometimes high, depends on different factors including occasional need for rebalancing. As long as all operations are in memory and all pages are "dirty" (not saved to disk) the results looks very promising. I tried to change cache size to 500,000 (x1024) for the same base (10M records, 100 bytes record size) and adding from scrach was very fast until about 3M records (taskman showing about 800Mb of memory used at the moment) and dropped significally after that. Will this bad point be 3M or not 3M depends on the size of data used in index. So if any search inside the tree is a parsing through the cached pages (either by sqlite or underlying file system) and any page write operation involved in new tree operations are dirty pages that will be write later, there is no problem. Otherwise there is. The other problem is when total datasize exceeds 32bit address space, the cache stops giving significant improvement. I know that quicksorting I suggested have disadvantages, but in some way its results is more predictable. For example, we postpone any sqlite writing operation until the data is ready. Unfortunately, for index data size bigger than 32bit address space the file-mapping doesn't work, but it can be changed to simple Seek acces and at least the file system can do a decent job keeping frequentely access data (currently) in cache. Marcus, I didn't understand your comment about frequent reloading. As I read from the initial post, he has large data chunk of unsorted data, 20M records that needed to be accessed on a daily basis. So in his case any way that lead to data placed inside sqlite db indexed properly is ok. The problem is that on a daily basis couple of hours is a big price to pay. Jerome can correct me, but he still didn't add anything new to this discussion, hope he will. Nice approach with the hash and collision resolving inside query, will keep it in mind for future use ) Max On Sun, Feb 14, 2010 at 4:03 PM, Marcus Grimm <mgr...@medcom-online.de>wrote: > Just for my curiosity: > > Have you tried to increase the cache as already suggested ? > > I ran into a similar problem while playing with a artificial > test database with appx. 10 Mio records and creating an index. > Without drastically increasing the cache size sqlite appears > not to be able to create an index on a certain field - it > never stops within, say, 2 hours. > Sounds quite dissapointing keeping in mind how fast sqlite usually > operates, but it becomes clear when we consider that sqlite > is told to use only a few MB of memory per default. ALso your > quicksort mentioned below will be very slow if he needs > reload data from disk all the time. > So in my case it helps to tell sqlite to use appx 500MB memory > via pragma cache_size. > > Please note that if you create an index on a text field sqlite > will basically make a copy of the hole table in your case. > > Depending how you are using that Text field in a filter statement > you may consider adding an integer hash (e.g. CRC32) entry in your > table and create an index on that and slightly change your queries > like: > SELECT * From TestTable WHERE TextHas=12312 AND Text='Text to search'; > Unfortunately that works only for that simple form of "=" statements. > > Marcus > > > Jerome, > > > > It's an an interesting challenge, thanks for the post > > I tried to research more and did some tests. > > My test database contains a table with 10,000,000 records of the text 100 > > chars in length > > > > CREATE TABLE [TestTable] ( > > [Id] INTEGER PRIMARY KEY AUTOINCREMENT, > > [Text] TEXT > > ) > > > > I suppose your data is different, but at least this one has more than 1M > > records and not so small record size. My final base was about 1G in size. > > > > The default index creation > > > > CREATE INDEX [idx_TestTable] ON [TestTable] ([Text] ) > > > > took very long time (about two hours or so). > > Having the index before data insert didn't change anything, first records > > had a good speed of append (about 10,000 records/sec significantly > slowing > > when the number of records exceeded 1-2M). > > > > So there was no way to ignore some external files approach and I did it > > filing memory-mapped file with the contents of Text field, while filling > > also array in memory saving offsets and length of the strings. After that > > quicksort of that offset array took about 5 minutes and inserting the > > textes > > in sorted order to sqlite base other 5 minutes, so about 10 minutes it > > total. First 5 minutes was possible since we exchange only offsets, not > > data > > and other 5 minutes since inserting sorted data into B -tree is really a > > fast operation. > > > > Although real life data can be different, the things that worked might be > > the same. So anyone can use this method occupying not more than the > sqlite > > file itself for temporary storage and ending up with the data in > necessary > > order inside sqlite database after that. I know that there are many > things > > to take into account like the memory size and the size of the actual data > > but it's just a proof of concept. > > > > Also I think sqlite could use the same approach internally for creating > > index for existing data. The db is probably already exclusively locked > > while > > CREATE INDEX is in process so having temporary array accessing and > storing > > for example file offsets of particular records should not be a problem. > > > > Max > > > > On Sat, Feb 13, 2010 at 5:00 PM, Jérôme Magnin > > <jerome.mag...@bluewin.ch>wrote: > > > >> Hi, > >> > >> This post is a question directed to D. Richard Hipp : > >> > >> I have been using SQLite for 3 years in a records linkage software > >> package I have developed. My organization recently had to use the > >> package to perform daily linkage of large administrative governmental > >> registers (up to 20 million records each). During the linkage process, > >> auxiliary tables containing records "fingerprints" must be created, and > >> two columns be indexed in them. > >> > >> SQLite provides decent indexing times for such tables with up to 1M > >> rows, but beyond this size the (already well-discussed) critical slowing > >> down of indexing performance due to disk nonlocality kicks in. The only > >> workaround I could imagine to ease the problem would be to duplicate the > >> auxiliary table and load pre-sorted rows in it, with sort key being the > >> column I intend to index on. This is unfortunately too costly in terms > >> of disk space used. > >> > >> I therefore had to develop an alternate datasource type (based on flat > >> files) in order for my system to be able to efficiently handle big > >> files. Which is a pity since SQLite provides great features I still > >> would like to be able to rely upon when dealing with large files. > >> > >> Now my question: in the "To do" section of SQLite's wiki, you mention > >> "Develop a new sort implementation that does much less disk seeking. Use > >> to improve indexing performance on large tables.". I have been seeing > >> this entry for 3 years but nothing concrete seems to have happened on > >> this front in the meantime. Do you have any idea about if (and when) you > >> will work on this in the future ? Can I nourish reasonable hopes that > >> the situation will improve on this aspect within the next 2 years ? This > >> really has become a critical factor for me to decide on my future > >> development strategy with this product. > >> > >> Thanks in advance for any useful information. > >> > >> Jerome > >> > >> > >> _______________________________________________ > >> sqlite-users mailing list > >> sqlite-users@sqlite.org > >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > >> > > _______________________________________________ > > sqlite-users mailing list > > sqlite-users@sqlite.org > > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > > > > > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users