Re: [sqlite] indexing large databases
On May 10, 2007, at 11:08 PM, Juri Wichanow wrote: For "create index.." in large database : "pragma default_cache_size = 2000;" For "select ..." -- "pragma default_cache_size = 1200;" Hmm, quite interesting. I would like to share my naive observations, which led me to believe the cache_size was important (although I am now beginning to doubt this, see below). First we run the indexing on server 1 - I am essentially the only user, but we only have 2GB of ram. I see (this is using top as is everything else I say about performance) that very quickly, the memory usage caps out and the cpu usage drops. My conclusion - an I/O bottleneck. So I increase cache_size and surely - the program runs longer before it seems to bottleneck. So I switch to another server (about the same speed, 16GB ram, unknown I/O performance compared to the first one), and increase the cache_size dramatically, hypothesizing that I could just have the entire db in ram (I know think I understand that I will still have to do I/O when writing the index). And surely I see no drop in CPU usage over the five days the program runs before the server gets rebooted. At this point (yesterday) I was convinced that the new server was dramatically better. Nevertheless I had never terminated my old runs on server 1. And lo and behold, yesterday I accidently discovered they were finished in around 7-8 days. This is something that makes no sense to me, but it could have something to do with how top measures cpu usage or whatever. Or I could have made a mistaken observation. Well, I am curious now, so I am going to time this carefully on various servers (and report my results back to the list of course). Kasper - To unsubscribe, send email to [EMAIL PROTECTED] -
[sqlite] indexing large databases
For "create index.." in large database : "pragma default_cache_size = 2000;" For "select ..." -- "pragma default_cache_size = 1200;" Juri
Re: [sqlite] indexing large databases
Kasper Daniel Hansen <[EMAIL PROTECTED]> wrote: > > Having said that, I can understand that sorting and disk cache and so > on factors in - but my initial database is already very large (1.3GB > - 145.000.000 milion rows), and surely that disk cache would already > factor in at that stage? > As long as your table does not contain out-of-order indices, each new row will be inserted at the end. So the last few disk pages will be in cache and everything will go quickly. It is when you have to start inserting at random places in the middle of the database - on disk pages that are not cached - that things slow down. -- D. Richard Hipp <[EMAIL PROTECTED]> - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] indexing large databases
On May 10, 2007, at 3:04 PM, [EMAIL PROTECTED] wrote: Kasper Daniel Hansen <[EMAIL PROTECTED]> wrote: Hi We are using SQLite for a fairly big (but simple) calculation, and have some problems when creating an index on a database with 600.000.000 rows. Specifically it has not ended even after 5 days of running. We have done it successfully on 25% of the full data base, and are now wondering what we can do. This is likely a locality of reference problem. It comes up about every two weeks on this mailing list. You can read more about it by looking at the mailing list archives. I think I know how to fix the problem. But I have not yet had an opportunity to implement the fix. Thank you for that pointer. I have now read up a bit on the references. From what I understand it has to do with the ordering of the data and the disk cache and so on. I could without too much trouble (although it will take some time) sort the data in a way such that it is sorted in the right way for the index (by right I mean if I create a (VAR1, VAR2) index I want VAR2 to be sorted within VAR1 in increasing order). Would that help? Or does the UNIQUE keyword make sqlite do some checking that will slow down everything. Having said that, I can understand that sorting and disk cache and so on factors in - but my initial database is already very large (1.3GB - 145.000.000 milion rows), and surely that disk cache would already factor in at that stage? Thanks, Kasper - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] indexing large databases
Kasper Daniel Hansen <[EMAIL PROTECTED]> wrote: > Hi > > We are using SQLite for a fairly big (but simple) calculation, and > have some problems when creating an index on a database with > 600.000.000 rows. Specifically it has not ended even after 5 days of > running. We have done it successfully on 25% of the full data base, > and are now wondering what we can do. > This is likely a locality of reference problem. It comes up about every two weeks on this mailing list. You can read more about it by looking at the mailing list archives. I think I know how to fix the problem. But I have not yet had an opportunity to implement the fix. -- D. Richard Hipp <[EMAIL PROTECTED]> - To unsubscribe, send email to [EMAIL PROTECTED] -
[sqlite] indexing large databases
Hi We are using SQLite for a fairly big (but simple) calculation, and have some problems when creating an index on a database with 600.000.000 rows. Specifically it has not ended even after 5 days of running. We have done it successfully on 25% of the full data base, and are now wondering what we can do. Details: We have a very simple database, just two columns VAR1 and VAR2 which are both integers. Every combination of (VAR1, VAR2) is unique. We want to add an index by using the following command CREATE UNIQUE INDEX on DATABASE (VAR1, VAR2); After creating the index we have to do appr. 100.000.000 lookups on (VAR1, VAR2). We have benchmarked the results on a small(er) database with 145.000.000 rows where the indexing takes around 1 hour and the speed of the lookup is _very_ acceptable. However when running the indexing command on the full database it does not terminate even after 5-6 days (unfortunately our server got rebooted while it was running). We are wondering on the rather drastic increase in run time when going from 145.000.000 rows to 600.000.000 rows. ?: does the unique keyword matter, ie. does it make lookups faster and does it slow down the index creation. ?: does the order of the data matter a lot - right now it is partially sorted in that all combinations involving a single value of VAR2 (not VAR1) are next to each other. ?: is it possible to enable some kind of logging/debugging mode so we can monitor progression and see how it is doing. We run the program on an 2600mhz Opteron with 16GB of ram and so far it seems we are not getting killed by I/O or RAM: using top frequently shows around 99% cpu usage and less than 40% memory usage. We are using the following program (including all pragmas): sqlite3 blasthits.db "pragma default_cache_size = 1200;" sqlite3 blasthits.db "pragma synchronous = off; pragma temp_store = 2; create unique index probeseqIDX on blasthits (probeidx,prokMSA);" (here blashits.db is the database name and probeidx, prokMSA are VAR1, VAR2). ?: Is there anything we can do to speed it up further? Thanks a lot for any help, even if it is "it does not seem to be possible to improve the speed of this operation". Jim and Kasper - To unsubscribe, send email to [EMAIL PROTECTED] -