On 2 Feb 2012, at 11:31pm, Udi Karni wrote:

> Given how clever and compelling Sqlite is - I am testing how it scales to
> tables in the 100GB / 200 million row range. This is for a strictly "read
> only" application - but first the tables must be populated in a one time
> process. As is often the case with Big Data - the data is a little dirty -
> so the process involves importing - selecting - counting - inspecting -
> updating some rows - deleting some rows - selecting - counting -
> inspecting, etc. until clean.
> 
> Placing the Sqlite database on a traditional C: drive - IO was too slow. At
> 15 MB/sec - reading a 50GB table would take an hour. So I moved it to
> external Raid array where I ran across an interesting find. IO wasn't that
> much faster - until I vaccuumed the database - which increase IO 10X to 150
> MB/sec - with the same CPU utilization.
> 
> This is good news for the final implementation of this read-only database -
> but still a dilemma at the data load phase. After a ".vaccuum" - issueing a
> single DML against a table - even a DELETE which deletes no rows at all -
> causes IO to drop back down to 15 MB/sec - on the table I'm selecting /
> DMLing - which makes the data loading / cleansing phase very long.

Nice description of your situation and requirements.  Makes it easier to answer 
your questions.

> So I have 2 questions -
> 
> (1) Why would simple DML cause such an extreme slowdown as compared with
> "post vaccuum" speeds ?

Spins.  Reading one area of the database file means waiting for rotations of 
the disk.  If the data you need is fragmented you end up wasting a lot of time. 
 You can't speed it up because you are just sitting there waiting for the disk 
to turn to be in the right place.  Instead of using a RAID try using a 
non-episodic medium like a solid-state storage instead.

> (2) Any knobs to turn to try and maintain the higher speeds post DML -
> without resorting to ".vaccuum" ?

There are a ton of things which will help but probably not enough to make it 
worthwhile for a one-shot job.  You can presort your data into batches.  
Searching an index which is already in order is faster.  You can close the 
database, use your OS' facilities to defragment the file, then reopen the 
database.  (This helps a lot under Windows which is very sensitive to 
fragmentation, somewhat less for other OSen.)  You can do your initial 
insertions into smaller databases then merge them.

But your telling of your situation suggests to me that this isn't worth doing.  
You have a one-time-only activity.  It's probably gonna take you longer to do 
the programming than you'll save.

Simon.
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to