Re: [sqlite] Prohibitive indexing time with large tables
Hello, Jérôme Nice to hear you finally joined us with this really interesting discussion ) > > To Max Vlasov: > > > in sorted order to sqlite base other 5 minutes, so about 10 minutes it > > total. First 5 minutes was possible since we exchange only offsets, > > not data > > and other 5 minutes since inserting sorted data into B -tree is really a > > fast operation. > . > Nice solution (of the type I already fiddled around, actually, as you > can imagine). > This variant still poses 2 problems for me: > > 1) Its workability is RAM-limited, and therefore not necessarily robust > to an > increase in dataset size beyond a certain limit I am already close to > (Win32-based > processes are still bound to max. 2GB/process, unfortunately). > > 2) I need to create 2 indices on 2 different columns whose contents is > totally > uncorrelated with respect to sort order. Your solution would nicely > reduce indexing time > of the 1st column but what about the 2nd one ?... > > > You addressed real problems, and my when I try to run my test on a system with lower RAM the results confirms these observations. But at least we found some way to increase the speed for some datasets and some hardware systems. Maybe some other approaches can improve the solution. The suggestion about using RAM drive form Ibrahim for example was interesting, I'd also mention for example using different hard drives together with merge sort, but all these solutions breaks the beauty of sqlite imho, and as a consequence the flexibiliy. But the second one is really hard to solve, that's where sqlite internally could take advantage of low-level data access, but I doubt this is an easy task. I suppose making any special sorting with direct file access can even break the beauty of vdbe not mentioning the danger of changing the code significantly By the way, you didn't mention the size of your "fingeprints". So can you calculate the average index record size or total index size in case of your 20M records? Max ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Prohibitive indexing time with large tables
> I'm really surprised you're not seeing something from an increased > cache size. It has always made a very noticeable difference in my own > manipulations, but I think the largest table I've ever worked with > "only" had 6M rows. > Jay, increasing cache size really helps, but till some size as I mentioned. And it really makes sense. I can assume what is going on. While we're not out of cache, all this B-tree structured data is completely inside RAM and is able to update/change very quickly, but when the time for flushing comes, we have many pages forming this b-tree and the question is what pages should we flushed making this part of cache free in order to next 1,000,000 records (with absolutely unpredictable content) use this tree more effectively. Honestly I'm not aware of such algorithm. So every next mass insert produces many "collisions", moments when we need to write to some page, but it's not in the write cache. But I guess there is something sqlite could do with using existing cache more effectively. I don't know the internals very well, but it seems that both read and write cache exist in sqltie. But read cache is sometimes almost unnecessary since this data already present in the system file cache. In this case accessing some page in the cache and accessing it with xRead interface of VFS will probably take the same or similar time (sure if there's no encryption or other overhead of VFS). But I'm not sure about this guess since it's sometimes hard to separate read and write cache and also for B-tree inserts most of pages just read will probably be written soon By the way, Alexey Pechnikov's recent post about his tests shows that page_cache_size not always helps and that you should take into account not only the number of rows, but also the total index size (quote from the http://geomapx.blogspot.com/2009/11/degradation-of-indexing-speed.html: "As we can see the index creating speed degradation is result of index size more than SQLite page cache size."). Max ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Prohibitive indexing time with large tables
Hello! See tests and some interpretation: http://geomapx.blogspot.com/2009/11/degradation-of-indexing-speed.html Best regards, Alexey Pechnikov. http://pechnikov.tel/ ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Prohibitive indexing time with large tables
Jay A. Kreibich a écrit : You said you're creating two databases. Are you doing those one at a time or ATTACHing the other databases? Doing them one at a time, indeed. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Prohibitive indexing time with large tables
On Mon, Feb 15, 2010 at 08:09:26PM +0100, Jérôme Magnin scratched on the wall: > > The database is 2GB > > but the least amount of memory I have is 4GB. I also use a 32kb page > > size, > > larger cache etc. > I have of course tried to increase cache size and database page size (up > to 32K) but the effect was > not really measurable. I therefore (maybe too quickly) have classified > such parameters as weak-working > ones for my particular problem, although I admit I also had great hopes > about how they would impact the indexing > performance. This lack of observable effect remains a mystery to me. The > only clue could be that the Python SQLite wrapper (my package is > Python-based) does actually not correctly forward the "PRAGMA xxx" > commands to the SQLite engine. You said you're creating two databases. Are you doing those one at a time or ATTACHing the other databases? The reason I ask is because each database has its own cache, so you need to be sure you bump up the cache size on the database that has the index in it. If that isn't the "main" database, you need to use the format "PRAGMA database.cache_size = value", or you're changing the wrong cache. I'm really surprised you're not seeing something from an increased cache size. It has always made a very noticeable difference in my own manipulations, but I think the largest table I've ever worked with "only" had 6M rows. -j -- Jay A. Kreibich < J A Y @ K R E I B I.C H > "Our opponent is an alien starship packed with atomic bombs. We have a protractor." "I'll go home and see if I can scrounge up a ruler and a piece of string." --from Anathem by Neal Stephenson ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Prohibitive indexing time with large tables
Sorry folks, it took me time (and a brand new brain) to figure out how I could post a reply that would appear on the mailing list (I'm new to that kind of sport). To Roger Binns: > As another data point, my data set is 10M records and the 6 indices I > need > are created in under two minutes. (I also analyze.) Wow, impressive ! On my side, I am talking about 2-3 hours per table (and I need several of them). > The database is 2GB > but the least amount of memory I have is 4GB. I also use a 32kb page > size, > larger cache etc. I have of course tried to increase cache size and database page size (up to 32K) but the effect was not really measurable. I therefore (maybe too quickly) have classified such parameters as weak-working ones for my particular problem, although I admit I also had great hopes about how they would impact the indexing performance. This lack of observable effect remains a mystery to me. The only clue could be that the Python SQLite wrapper (my package is Python-based) does actually not correctly forward the "PRAGMA xxx" commands to the SQLite engine. > Surely the index generation is a one time thing being done at upgrade > time. At table population time, actually. The table is then further used exclusively in read-only mode. > Couldn't that be done late at night so taking several hours would be ok? Well, yes and no. This software does not run on a server but is operated by some of my collaborators on their office workstations. They cannot always predict when they need to perform a linkage task but when the situation suddenly occurs, they should not be constrained to wait until next workday in order to dispose of a usable database. Besides, I really find the situation inelegant and want my software to be efficient even for large datasets. To Max Vlasov: > So there was no way to ignore some external files approach and I did it > filing memory-mapped file with the contents of Text field, while filling > also array in memory saving offsets and length of the strings. After that > quicksort of that offset array took about 5 minutes and inserting the > textes > in sorted order to sqlite base other 5 minutes, so about 10 minutes it > total. First 5 minutes was possible since we exchange only offsets, > not data > and other 5 minutes since inserting sorted data into B -tree is really a > fast operation. > > Although real life data can be different, the things that worked might be > the same. So anyone can use this method occupying not more than the > sqlite > file itself for temporary storage and ending up with the data in > necessary > order inside sqlite database after that. I know that there are many > things > to take into account like the memory size and the size of the actual data > but it's just a proof of concept. > > Also I think sqlite could use the same approach internally for creating > index for existing data. The db is probably already exclusively locked > while > CREATE INDEX is in process so having temporary array accessing and > storing > for example file offsets of particular records should not be a problem. Nice solution (of the type I already fiddled around, actually, as you can imagine). This variant still poses 2 problems for me: 1) Its workability is RAM-limited, and therefore not necessarily robust to an increase in dataset size beyond a certain limit I am already close to (Win32-based processes are still bound to max. 2GB/process, unfortunately). 2) I need to create 2 indices on 2 different columns whose contents is totally uncorrelated with respect to sort order. Your solution would nicely reduce indexing time of the 1st column but what about the 2nd one ?... To Jay Kreibich: > If you haven't already try increasing the default cache size by 100x or > higher. Just be sure you stay in physical memory. I have, but without apparent effect (see my reply to Roger Binns). > Yes... Virtual Tables will end up doing flat table-scans in most cases > (unless you get very complex with the code) but is often extremely fast, > especially if you're able to mmap the file. I've used them with > great success for many "read-only" type tables where the typical > access pattern is a scan and aggregation. Log files, for example. > Although, if you're talking about creating multiple indexes, I assume > that isn't going to work in this situation. Right. Then there would also be another problem to me using virtual tables : as I am using SQLite through Python, in order to be able to use virtual tables I would have to migrate my code to make use of a Python wrapper that is closer to SQLite's native API, like APSW. Such wrapper is then non DB-API compliant and would break code genericity (my system is architectured to be able to connect to various database engines, but the price to pay is to remain DB-API compliant). * * * It seems that my post has spawn some debates that sometimes rely on false assumptions about the nature of what I am
Re: [sqlite] Prohibitive indexing time with large tables
> His original question was about the todo list found at the wiki. > Not sure if any of the core developer will answer, but I would > be interested as well... I too have been long interested in when this might be addressed, since we see similar performance drop-offs with large numbers of rows. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Prohibitive indexing time with large tables
Am 14.02.2010 18:53, schrieb Max Vlasov: >> This is appx. 500MB cache, why not trying with 2,000,000 cache size ? :-) >> >> >> > Hmm, managed to increase it to only 1,000,000 (x1024) size, larger values > bring to "Out of memory" finally, and this values (1G) allows up to > 6,000,000 fast records for 100 bytes field per record index. Still good, > such extreme cache method can work in many cases I think. > > For such problems i would suggest a different solution practiced by many companies : Simply buying a Battery buffered RAM Drive. The Price (as an Example for the ASUS Drive) is about 250 $. For a governmental Solution 250 $ + RAM Costs won't really matter. The speed improvement (RAM Access is at least 1 times faster than HD access) will make the costs ignorable. On the software side by simply inserting new Indexes a few thousand at a time in a completely new BTREE (memory) and Merging existing Pages with this new ones would make special sorting or other techniques unnecessary. Its the simplest and fastes solution. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Prohibitive indexing time with large tables
> This is appx. 500MB cache, why not trying with 2,000,000 cache size ? :-) > > Hmm, managed to increase it to only 1,000,000 (x1024) size, larger values bring to "Out of memory" finally, and this values (1G) allows up to 6,000,000 fast records for 100 bytes field per record index. Still good, such extreme cache method can work in many cases I think. Max ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Prohibitive indexing time with large tables
> Marcus, > > although increasing cache size is a good method, it may sometimes give > unpredictable results (in terms of performance). hm... why ? > I looked at the vdbe code (EXPLAIN CREATE INDEX ... ) of the index > creation > and it seems like there is no special sorting algorithm (CMIIW please). I guess you are right, doesn't make sense to assume a sorting. Sorry. > Excluding all "make once" lines from vdbe code left this loop lines: > > "19""Rowid" > "20""Column" > "21""MakeRecord" > "22""IdxInsert" > "23""Next" .. 19 > > that leads to conlusion that either single "indexed" insert or a step from > index creation is actually just an insert of data inside B-tree without > special preprocessing. But to insert that index data sqlite needs to at least partly compare existing values with the one that it is about to insert. Here, I guess, comes the cache into account: When sqlite is not able to hold the running index in memory and/or the entries to be inserted are unsorted it has to free and reload pages randomly. Anyway, I should better stop guessing and leave this to people who are more in the sqlite internal code... :-) > > So we just consider mass insert into B-tree. As I suppose the cost of such > insert is sometimes low, sometimes high, depends on different factors > including occasional need for rebalancing. As long as all operations are > in > memory and all pages are "dirty" (not saved to disk) the results looks > very > promising. I tried to change cache size to 500,000 (x1024) for the same This is appx. 500MB cache, why not trying with 2,000,000 cache size ? :-) > base > (10M records, 100 bytes record size) and adding from scrach was very fast > until about 3M records (taskman showing about 800Mb of memory used at the > moment) and dropped significally after that. Will this bad point be 3M or > not 3M depends on the size of data used in index. So if any search inside > the tree is a parsing through the cached pages (either by sqlite or > underlying file system) and any page write operation involved in new tree > operations are dirty pages that will be write later, there is no problem. > Otherwise there is. The other problem is when total datasize exceeds 32bit > address space, the cache stops giving significant improvement. Yes, I was assuming that the index will more or less fit into RAM. At my experiments it appears to be the case but I had an index on the hash value not on the text column itselve, thus reducing the memory required. > > I know that quicksorting I suggested have disadvantages, but in some way > its > results is more predictable. For example, we postpone any sqlite writing > operation until the data is ready. Unfortunately, for index data size > bigger > than 32bit address space the file-mapping doesn't work, but it can be > changed to simple Seek acces and at least the file system can do a decent > job keeping frequentely access data (currently) in cache. > > Marcus, I didn't understand your comment about frequent reloading. As I > read > from the initial post, he has large data chunk of unsorted data, 20M > records > that needed to be accessed on a daily basis. So in his case any way that > lead to data placed inside sqlite db indexed properly is ok. The problem > is > that on a daily basis couple of hours is a big price to pay. Jerome can > correct me, but he still didn't add anything new to this discussion, hope > he > will. His original question was about the todo list found at the wiki. Not sure if any of the core developer will answer, but I would be interested as well... Anyway, thanks for discussing. best Marcus > > Nice approach with the hash and collision resolving inside query, will > keep > it in mind for future use ) > > Max > > On Sun, Feb 14, 2010 at 4:03 PM, Marcus Grimm >wrote: > >> Just for my curiosity: >> >> Have you tried to increase the cache as already suggested ? >> >> I ran into a similar problem while playing with a artificial >> test database with appx. 10 Mio records and creating an index. >> Without drastically increasing the cache size sqlite appears >> not to be able to create an index on a certain field - it >> never stops within, say, 2 hours. >> Sounds quite dissapointing keeping in mind how fast sqlite usually >> operates, but it becomes clear when we consider that sqlite >> is told to use only a few MB of memory per default. ALso your >> quicksort mentioned below will be very slow if he needs >> reload data from disk all the time. >> So in my case it helps to tell sqlite to use appx 500MB memory >> via pragma cache_size. >> >> Please note that if you create an index on a text field sqlite >> will basically make a copy of the hole table in your case. >> >> Depending how you are using that Text field in a filter statement >> you may consider adding an integer hash (e.g. CRC32) entry in your >> table and create an index on that and slightly change your queries >> like: >> SELECT *
Re: [sqlite] Prohibitive indexing time with large tables
Marcus, although increasing cache size is a good method, it may sometimes give unpredictable results (in terms of performance). I looked at the vdbe code (EXPLAIN CREATE INDEX ... ) of the index creation and it seems like there is no special sorting algorithm (CMIIW please). Excluding all "make once" lines from vdbe code left this loop lines: "19""Rowid" "20""Column" "21""MakeRecord" "22""IdxInsert" "23""Next" .. 19 that leads to conlusion that either single "indexed" insert or a step from index creation is actually just an insert of data inside B-tree without special preprocessing. So we just consider mass insert into B-tree. As I suppose the cost of such insert is sometimes low, sometimes high, depends on different factors including occasional need for rebalancing. As long as all operations are in memory and all pages are "dirty" (not saved to disk) the results looks very promising. I tried to change cache size to 500,000 (x1024) for the same base (10M records, 100 bytes record size) and adding from scrach was very fast until about 3M records (taskman showing about 800Mb of memory used at the moment) and dropped significally after that. Will this bad point be 3M or not 3M depends on the size of data used in index. So if any search inside the tree is a parsing through the cached pages (either by sqlite or underlying file system) and any page write operation involved in new tree operations are dirty pages that will be write later, there is no problem. Otherwise there is. The other problem is when total datasize exceeds 32bit address space, the cache stops giving significant improvement. I know that quicksorting I suggested have disadvantages, but in some way its results is more predictable. For example, we postpone any sqlite writing operation until the data is ready. Unfortunately, for index data size bigger than 32bit address space the file-mapping doesn't work, but it can be changed to simple Seek acces and at least the file system can do a decent job keeping frequentely access data (currently) in cache. Marcus, I didn't understand your comment about frequent reloading. As I read from the initial post, he has large data chunk of unsorted data, 20M records that needed to be accessed on a daily basis. So in his case any way that lead to data placed inside sqlite db indexed properly is ok. The problem is that on a daily basis couple of hours is a big price to pay. Jerome can correct me, but he still didn't add anything new to this discussion, hope he will. Nice approach with the hash and collision resolving inside query, will keep it in mind for future use ) Max On Sun, Feb 14, 2010 at 4:03 PM, Marcus Grimmwrote: > Just for my curiosity: > > Have you tried to increase the cache as already suggested ? > > I ran into a similar problem while playing with a artificial > test database with appx. 10 Mio records and creating an index. > Without drastically increasing the cache size sqlite appears > not to be able to create an index on a certain field - it > never stops within, say, 2 hours. > Sounds quite dissapointing keeping in mind how fast sqlite usually > operates, but it becomes clear when we consider that sqlite > is told to use only a few MB of memory per default. ALso your > quicksort mentioned below will be very slow if he needs > reload data from disk all the time. > So in my case it helps to tell sqlite to use appx 500MB memory > via pragma cache_size. > > Please note that if you create an index on a text field sqlite > will basically make a copy of the hole table in your case. > > Depending how you are using that Text field in a filter statement > you may consider adding an integer hash (e.g. CRC32) entry in your > table and create an index on that and slightly change your queries > like: > SELECT * From TestTable WHERE TextHas=12312 AND Text='Text to search'; > Unfortunately that works only for that simple form of "=" statements. > > Marcus > > > Jerome, > > > > It's an an interesting challenge, thanks for the post > > I tried to research more and did some tests. > > My test database contains a table with 10,000,000 records of the text 100 > > chars in length > > > > CREATE TABLE [TestTable] ( > > [Id] INTEGER PRIMARY KEY AUTOINCREMENT, > > [Text] TEXT > > ) > > > > I suppose your data is different, but at least this one has more than 1M > > records and not so small record size. My final base was about 1G in size. > > > > The default index creation > > > > CREATE INDEX [idx_TestTable] ON [TestTable] ([Text] ) > > > > took very long time (about two hours or so). > > Having the index before data insert didn't change anything, first records > > had a good speed of append (about 10,000 records/sec significantly > slowing > > when the number of records exceeded 1-2M). > > > > So there was no way to ignore some external files approach and I did it > > filing memory-mapped file with the contents of Text field, while filling > > also array in memory
Re: [sqlite] Prohibitive indexing time with large tables
Just for my curiosity: Have you tried to increase the cache as already suggested ? I ran into a similar problem while playing with a artificial test database with appx. 10 Mio records and creating an index. Without drastically increasing the cache size sqlite appears not to be able to create an index on a certain field - it never stops within, say, 2 hours. Sounds quite dissapointing keeping in mind how fast sqlite usually operates, but it becomes clear when we consider that sqlite is told to use only a few MB of memory per default. ALso your quicksort mentioned below will be very slow if he needs reload data from disk all the time. So in my case it helps to tell sqlite to use appx 500MB memory via pragma cache_size. Please note that if you create an index on a text field sqlite will basically make a copy of the hole table in your case. Depending how you are using that Text field in a filter statement you may consider adding an integer hash (e.g. CRC32) entry in your table and create an index on that and slightly change your queries like: SELECT * From TestTable WHERE TextHas=12312 AND Text='Text to search'; Unfortunately that works only for that simple form of "=" statements. Marcus > Jerome, > > It's an an interesting challenge, thanks for the post > I tried to research more and did some tests. > My test database contains a table with 10,000,000 records of the text 100 > chars in length > > CREATE TABLE [TestTable] ( > [Id] INTEGER PRIMARY KEY AUTOINCREMENT, > [Text] TEXT > ) > > I suppose your data is different, but at least this one has more than 1M > records and not so small record size. My final base was about 1G in size. > > The default index creation > > CREATE INDEX [idx_TestTable] ON [TestTable] ([Text] ) > > took very long time (about two hours or so). > Having the index before data insert didn't change anything, first records > had a good speed of append (about 10,000 records/sec significantly slowing > when the number of records exceeded 1-2M). > > So there was no way to ignore some external files approach and I did it > filing memory-mapped file with the contents of Text field, while filling > also array in memory saving offsets and length of the strings. After that > quicksort of that offset array took about 5 minutes and inserting the > textes > in sorted order to sqlite base other 5 minutes, so about 10 minutes it > total. First 5 minutes was possible since we exchange only offsets, not > data > and other 5 minutes since inserting sorted data into B -tree is really a > fast operation. > > Although real life data can be different, the things that worked might be > the same. So anyone can use this method occupying not more than the sqlite > file itself for temporary storage and ending up with the data in necessary > order inside sqlite database after that. I know that there are many things > to take into account like the memory size and the size of the actual data > but it's just a proof of concept. > > Also I think sqlite could use the same approach internally for creating > index for existing data. The db is probably already exclusively locked > while > CREATE INDEX is in process so having temporary array accessing and storing > for example file offsets of particular records should not be a problem. > > Max > > On Sat, Feb 13, 2010 at 5:00 PM, Jérôme Magnin >wrote: > >> Hi, >> >> This post is a question directed to D. Richard Hipp : >> >> I have been using SQLite for 3 years in a records linkage software >> package I have developed. My organization recently had to use the >> package to perform daily linkage of large administrative governmental >> registers (up to 20 million records each). During the linkage process, >> auxiliary tables containing records "fingerprints" must be created, and >> two columns be indexed in them. >> >> SQLite provides decent indexing times for such tables with up to 1M >> rows, but beyond this size the (already well-discussed) critical slowing >> down of indexing performance due to disk nonlocality kicks in. The only >> workaround I could imagine to ease the problem would be to duplicate the >> auxiliary table and load pre-sorted rows in it, with sort key being the >> column I intend to index on. This is unfortunately too costly in terms >> of disk space used. >> >> I therefore had to develop an alternate datasource type (based on flat >> files) in order for my system to be able to efficiently handle big >> files. Which is a pity since SQLite provides great features I still >> would like to be able to rely upon when dealing with large files. >> >> Now my question: in the "To do" section of SQLite's wiki, you mention >> "Develop a new sort implementation that does much less disk seeking. Use >> to improve indexing performance on large tables.". I have been seeing >> this entry for 3 years but nothing concrete seems to have happened on >> this front in the meantime. Do you have any idea about if (and when) you >> will work on
Re: [sqlite] Prohibitive indexing time with large tables
Jerome, It's an an interesting challenge, thanks for the post I tried to research more and did some tests. My test database contains a table with 10,000,000 records of the text 100 chars in length CREATE TABLE [TestTable] ( [Id] INTEGER PRIMARY KEY AUTOINCREMENT, [Text] TEXT ) I suppose your data is different, but at least this one has more than 1M records and not so small record size. My final base was about 1G in size. The default index creation CREATE INDEX [idx_TestTable] ON [TestTable] ([Text] ) took very long time (about two hours or so). Having the index before data insert didn't change anything, first records had a good speed of append (about 10,000 records/sec significantly slowing when the number of records exceeded 1-2M). So there was no way to ignore some external files approach and I did it filing memory-mapped file with the contents of Text field, while filling also array in memory saving offsets and length of the strings. After that quicksort of that offset array took about 5 minutes and inserting the textes in sorted order to sqlite base other 5 minutes, so about 10 minutes it total. First 5 minutes was possible since we exchange only offsets, not data and other 5 minutes since inserting sorted data into B -tree is really a fast operation. Although real life data can be different, the things that worked might be the same. So anyone can use this method occupying not more than the sqlite file itself for temporary storage and ending up with the data in necessary order inside sqlite database after that. I know that there are many things to take into account like the memory size and the size of the actual data but it's just a proof of concept. Also I think sqlite could use the same approach internally for creating index for existing data. The db is probably already exclusively locked while CREATE INDEX is in process so having temporary array accessing and storing for example file offsets of particular records should not be a problem. Max On Sat, Feb 13, 2010 at 5:00 PM, Jérôme Magninwrote: > Hi, > > This post is a question directed to D. Richard Hipp : > > I have been using SQLite for 3 years in a records linkage software > package I have developed. My organization recently had to use the > package to perform daily linkage of large administrative governmental > registers (up to 20 million records each). During the linkage process, > auxiliary tables containing records "fingerprints" must be created, and > two columns be indexed in them. > > SQLite provides decent indexing times for such tables with up to 1M > rows, but beyond this size the (already well-discussed) critical slowing > down of indexing performance due to disk nonlocality kicks in. The only > workaround I could imagine to ease the problem would be to duplicate the > auxiliary table and load pre-sorted rows in it, with sort key being the > column I intend to index on. This is unfortunately too costly in terms > of disk space used. > > I therefore had to develop an alternate datasource type (based on flat > files) in order for my system to be able to efficiently handle big > files. Which is a pity since SQLite provides great features I still > would like to be able to rely upon when dealing with large files. > > Now my question: in the "To do" section of SQLite's wiki, you mention > "Develop a new sort implementation that does much less disk seeking. Use > to improve indexing performance on large tables.". I have been seeing > this entry for 3 years but nothing concrete seems to have happened on > this front in the meantime. Do you have any idea about if (and when) you > will work on this in the future ? Can I nourish reasonable hopes that > the situation will improve on this aspect within the next 2 years ? This > really has become a critical factor for me to decide on my future > development strategy with this product. > > Thanks in advance for any useful information. > > Jerome > > > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Prohibitive indexing time with large tables
On Sat, Feb 13, 2010 at 11:21:25AM -0800, Roger Binns scratched on the wall: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Jérôme Magnin wrote: > > SQLite provides decent indexing times for such tables with up to 1M > > rows, > > As another data point, my data set is 10M records and the 6 indices I need > are created in under two minutes. (I also analyze.) The database is 2GB > but the least amount of memory I have is 4GB. I also use a 32kb page size, > larger cache etc. A larger cache makes a huge difference with index creation. As the website indicates, there are some known issues with the index sorting algorithm that, to my knowledge, have never bubbled to the top of the development team's TODO list. The existing insert algorithm does its job, but tends to need access to a large number of pages which can cause major cache thrashing if the index pages overflow the cache size. If you haven't already try increasing the default cache size by 100x or higher. Just be sure you stay in physical memory. > > I therefore had to develop an alternate datasource type (based on flat > > files) in order for my system to be able to efficiently handle big > > files. Which is a pity since SQLite provides great features I still > > would like to be able to rely upon when dealing with large files. > > You can also make virtual tables which will let you have the appearance of > everything being in SQLite. Yes... Virtual Tables will end up doing flat table-scans in most cases (unless you get very complex with the code) but is often extremely fast, especially if you're able to mmap the file. I've used them with great success for many "read-only" type tables where the typical access pattern is a scan and aggregation. Log files, for example. Although, if you're talking about creating multiple indexes, I assume that isn't going to work in this situation. Virtual Tables that link to external data sources are really cool and hugely useful, however. I'm surprised they aren't used more. If nothing else, they make great data importers. -j -- Jay A. Kreibich < J A Y @ K R E I B I.C H > "Our opponent is an alien starship packed with atomic bombs. We have a protractor." "I'll go home and see if I can scrounge up a ruler and a piece of string." --from Anathem by Neal Stephenson ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Prohibitive indexing time with large tables
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Jérôme Magnin wrote: > SQLite provides decent indexing times for such tables with up to 1M > rows, As another data point, my data set is 10M records and the 6 indices I need are created in under two minutes. (I also analyze.) The database is 2GB but the least amount of memory I have is 4GB. I also use a 32kb page size, larger cache etc. > The only > workaround I could imagine to ease the problem would be to duplicate the > auxiliary table and load pre-sorted rows in it, with sort key being the > column I intend to index on. This is unfortunately too costly in terms > of disk space used. What numbers are you getting? One man's unacceptable is another's ok. Surely the index generation is a one time thing being done at upgrade time. Couldn't that be done late at night so taking several hours would be ok? > I therefore had to develop an alternate datasource type (based on flat > files) in order for my system to be able to efficiently handle big > files. Which is a pity since SQLite provides great features I still > would like to be able to rely upon when dealing with large files. You can also make virtual tables which will let you have the appearance of everything being in SQLite. Note that SQLite is unlikely to beat a custom storage format specifically formatted for a particular data set, as it is general code. > Can I nourish reasonable hopes that the situation will improve on this aspect > within the next 2 years ? This > really has become a critical factor for me to decide on my future > development strategy with this product. Unless you post a representative schema and data set somewhere, no one else can make it better for what you are doing. Alternatively see the first section of http://sqlite.org/support.html Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkt2+7UACgkQmOOfHg372QRC/wCeLu5+vffdSfenbi+vM5qG4OnU iFUAn3cJPbZRbSNjNyXTXhtixRaGBgOJ =KCyU -END PGP SIGNATURE- ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users