> Marcus, > > although increasing cache size is a good method, it may sometimes give > unpredictable results (in terms of performance).
hm... why ? > I looked at the vdbe code (EXPLAIN CREATE INDEX ... ) of the index > creation > and it seems like there is no special sorting algorithm (CMIIW please). I guess you are right, doesn't make sense to assume a sorting. Sorry. > Excluding all "make once" lines from vdbe code left this loop lines: > > "19" "Rowid" > "20" "Column" > "21" "MakeRecord" > "22" "IdxInsert" > "23" "Next" .. 19 > > that leads to conlusion that either single "indexed" insert or a step from > index creation is actually just an insert of data inside B-tree without > special preprocessing. But to insert that index data sqlite needs to at least partly compare existing values with the one that it is about to insert. Here, I guess, comes the cache into account: When sqlite is not able to hold the running index in memory and/or the entries to be inserted are unsorted it has to free and reload pages randomly. Anyway, I should better stop guessing and leave this to people who are more in the sqlite internal code... :-) > > So we just consider mass insert into B-tree. As I suppose the cost of such > insert is sometimes low, sometimes high, depends on different factors > including occasional need for rebalancing. As long as all operations are > in > memory and all pages are "dirty" (not saved to disk) the results looks > very > promising. I tried to change cache size to 500,000 (x1024) for the same This is appx. 500MB cache, why not trying with 2,000,000 cache size ? :-) > base > (10M records, 100 bytes record size) and adding from scrach was very fast > until about 3M records (taskman showing about 800Mb of memory used at the > moment) and dropped significally after that. Will this bad point be 3M or > not 3M depends on the size of data used in index. So if any search inside > the tree is a parsing through the cached pages (either by sqlite or > underlying file system) and any page write operation involved in new tree > operations are dirty pages that will be write later, there is no problem. > Otherwise there is. The other problem is when total datasize exceeds 32bit > address space, the cache stops giving significant improvement. Yes, I was assuming that the index will more or less fit into RAM. At my experiments it appears to be the case but I had an index on the hash value not on the text column itselve, thus reducing the memory required. > > I know that quicksorting I suggested have disadvantages, but in some way > its > results is more predictable. For example, we postpone any sqlite writing > operation until the data is ready. Unfortunately, for index data size > bigger > than 32bit address space the file-mapping doesn't work, but it can be > changed to simple Seek acces and at least the file system can do a decent > job keeping frequentely access data (currently) in cache. > > Marcus, I didn't understand your comment about frequent reloading. As I > read > from the initial post, he has large data chunk of unsorted data, 20M > records > that needed to be accessed on a daily basis. So in his case any way that > lead to data placed inside sqlite db indexed properly is ok. The problem > is > that on a daily basis couple of hours is a big price to pay. Jerome can > correct me, but he still didn't add anything new to this discussion, hope > he > will. His original question was about the todo list found at the wiki. Not sure if any of the core developer will answer, but I would be interested as well... Anyway, thanks for discussing. best Marcus > > Nice approach with the hash and collision resolving inside query, will > keep > it in mind for future use ) > > Max > > On Sun, Feb 14, 2010 at 4:03 PM, Marcus Grimm > <mgr...@medcom-online.de>wrote: > >> Just for my curiosity: >> >> Have you tried to increase the cache as already suggested ? >> >> I ran into a similar problem while playing with a artificial >> test database with appx. 10 Mio records and creating an index. >> Without drastically increasing the cache size sqlite appears >> not to be able to create an index on a certain field - it >> never stops within, say, 2 hours. >> Sounds quite dissapointing keeping in mind how fast sqlite usually >> operates, but it becomes clear when we consider that sqlite >> is told to use only a few MB of memory per default. ALso your >> quicksort mentioned below will be very slow if he needs >> reload data from disk all the time. >> So in my case it helps to tell sqlite to use appx 500MB memory >> via pragma cache_size. >> >> Please note that if you create an index on a text field sqlite >> will basically make a copy of the hole table in your case. >> >> Depending how you are using that Text field in a filter statement >> you may consider adding an integer hash (e.g. CRC32) entry in your >> table and create an index on that and slightly change your queries >> like: >> SELECT * From TestTable WHERE TextHas=12312 AND Text='Text to search'; >> Unfortunately that works only for that simple form of "=" statements. >> >> Marcus >> >> > Jerome, >> > >> > It's an an interesting challenge, thanks for the post >> > I tried to research more and did some tests. >> > My test database contains a table with 10,000,000 records of the text >> 100 >> > chars in length >> > >> > CREATE TABLE [TestTable] ( >> > [Id] INTEGER PRIMARY KEY AUTOINCREMENT, >> > [Text] TEXT >> > ) >> > >> > I suppose your data is different, but at least this one has more than >> 1M >> > records and not so small record size. My final base was about 1G in >> size. >> > >> > The default index creation >> > >> > CREATE INDEX [idx_TestTable] ON [TestTable] ([Text] ) >> > >> > took very long time (about two hours or so). >> > Having the index before data insert didn't change anything, first >> records >> > had a good speed of append (about 10,000 records/sec significantly >> slowing >> > when the number of records exceeded 1-2M). >> > >> > So there was no way to ignore some external files approach and I did >> it >> > filing memory-mapped file with the contents of Text field, while >> filling >> > also array in memory saving offsets and length of the strings. After >> that >> > quicksort of that offset array took about 5 minutes and inserting the >> > textes >> > in sorted order to sqlite base other 5 minutes, so about 10 minutes it >> > total. First 5 minutes was possible since we exchange only offsets, >> not >> > data >> > and other 5 minutes since inserting sorted data into B -tree is really >> a >> > fast operation. >> > >> > Although real life data can be different, the things that worked might >> be >> > the same. So anyone can use this method occupying not more than the >> sqlite >> > file itself for temporary storage and ending up with the data in >> necessary >> > order inside sqlite database after that. I know that there are many >> things >> > to take into account like the memory size and the size of the actual >> data >> > but it's just a proof of concept. >> > >> > Also I think sqlite could use the same approach internally for >> creating >> > index for existing data. The db is probably already exclusively locked >> > while >> > CREATE INDEX is in process so having temporary array accessing and >> storing >> > for example file offsets of particular records should not be a >> problem. >> > >> > Max >> > >> > On Sat, Feb 13, 2010 at 5:00 PM, Jérôme Magnin >> > <jerome.mag...@bluewin.ch>wrote: >> > >> >> Hi, >> >> >> >> This post is a question directed to D. Richard Hipp : >> >> >> >> I have been using SQLite for 3 years in a records linkage software >> >> package I have developed. My organization recently had to use the >> >> package to perform daily linkage of large administrative governmental >> >> registers (up to 20 million records each). During the linkage >> process, >> >> auxiliary tables containing records "fingerprints" must be created, >> and >> >> two columns be indexed in them. >> >> >> >> SQLite provides decent indexing times for such tables with up to 1M >> >> rows, but beyond this size the (already well-discussed) critical >> slowing >> >> down of indexing performance due to disk nonlocality kicks in. The >> only >> >> workaround I could imagine to ease the problem would be to duplicate >> the >> >> auxiliary table and load pre-sorted rows in it, with sort key being >> the >> >> column I intend to index on. This is unfortunately too costly in >> terms >> >> of disk space used. >> >> >> >> I therefore had to develop an alternate datasource type (based on >> flat >> >> files) in order for my system to be able to efficiently handle big >> >> files. Which is a pity since SQLite provides great features I still >> >> would like to be able to rely upon when dealing with large files. >> >> >> >> Now my question: in the "To do" section of SQLite's wiki, you mention >> >> "Develop a new sort implementation that does much less disk seeking. >> Use >> >> to improve indexing performance on large tables.". I have been seeing >> >> this entry for 3 years but nothing concrete seems to have happened on >> >> this front in the meantime. Do you have any idea about if (and when) >> you >> >> will work on this in the future ? Can I nourish reasonable hopes that >> >> the situation will improve on this aspect within the next 2 years ? >> This >> >> really has become a critical factor for me to decide on my future >> >> development strategy with this product. >> >> >> >> Thanks in advance for any useful information. >> >> >> >> Jerome >> >> >> >> >> >> _______________________________________________ >> >> sqlite-users mailing list >> >> sqlite-users@sqlite.org >> >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >> >> >> > _______________________________________________ >> > sqlite-users mailing list >> > sqlite-users@sqlite.org >> > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >> > >> >> >> _______________________________________________ >> sqlite-users mailing list >> sqlite-users@sqlite.org >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >> > _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users