Thanks ! Good ideas. SSD is still a little exotic price-wise, but closing and defragging is easy to try.
Coming to think of it - your suggestion to break the process into batches - is probably the way to go. By placing each table in its own DB - not only are they smaller and easier to manager - but the vaccuum process becomes a single table vaccuum instead of a multi-table vaccum - not having to redo tables that are already streamlined. Very cool. In which case - how about not combining them at all - and attaching? In other words - is the performance of 1 large DB that includes all the data tables plus the "work" tables - about equivalent to a "work only" DB that attaches, as needed, to multiple external DBs - each containing 1 table? With some databases - cross-database communication is much slower than keeping everything local. Is there substantial overhead to using multiple "attach" statements in Sqlite? Thanks, Udi On Thu, Feb 2, 2012 at 3:51 PM, Simon Slavin <slav...@bigfraud.org> wrote: > > On 2 Feb 2012, at 11:31pm, Udi Karni wrote: > > > Given how clever and compelling Sqlite is - I am testing how it scales to > > tables in the 100GB / 200 million row range. This is for a strictly "read > > only" application - but first the tables must be populated in a one time > > process. As is often the case with Big Data - the data is a little dirty > - > > so the process involves importing - selecting - counting - inspecting - > > updating some rows - deleting some rows - selecting - counting - > > inspecting, etc. until clean. > > > > Placing the Sqlite database on a traditional C: drive - IO was too slow. > At > > 15 MB/sec - reading a 50GB table would take an hour. So I moved it to > > external Raid array where I ran across an interesting find. IO wasn't > that > > much faster - until I vaccuumed the database - which increase IO 10X to > 150 > > MB/sec - with the same CPU utilization. > > > > This is good news for the final implementation of this read-only > database - > > but still a dilemma at the data load phase. After a ".vaccuum" - > issueing a > > single DML against a table - even a DELETE which deletes no rows at all - > > causes IO to drop back down to 15 MB/sec - on the table I'm selecting / > > DMLing - which makes the data loading / cleansing phase very long. > > Nice description of your situation and requirements. Makes it easier to > answer your questions. > > > So I have 2 questions - > > > > (1) Why would simple DML cause such an extreme slowdown as compared with > > "post vaccuum" speeds ? > > Spins. Reading one area of the database file means waiting for rotations > of the disk. If the data you need is fragmented you end up wasting a lot > of time. You can't speed it up because you are just sitting there waiting > for the disk to turn to be in the right place. Instead of using a RAID try > using a non-episodic medium like a solid-state storage instead. > > > (2) Any knobs to turn to try and maintain the higher speeds post DML - > > without resorting to ".vaccuum" ? > > There are a ton of things which will help but probably not enough to make > it worthwhile for a one-shot job. You can presort your data into batches. > Searching an index which is already in order is faster. You can close the > database, use your OS' facilities to defragment the file, then reopen the > database. (This helps a lot under Windows which is very sensitive to > fragmentation, somewhat less for other OSen.) You can do your initial > insertions into smaller databases then merge them. > > But your telling of your situation suggests to me that this isn't worth > doing. You have a one-time-only activity. It's probably gonna take you > longer to do the programming than you'll save. > > Simon. > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users