Thank you very much for your insights and I agree with you on the role
of diskspeed vs. CPU in this process. However, in the bioinformatics
application I am maintaining, we have huge databases that easily
exceed the size of regular SSD drives (512G for example) and we have
to rely on diskarraries of regular harddrives, with less-than-optimal
random access speed.

One of the problems we have is the performance of analyzing large
tables one by one. For example, we are puzzled that it takes from 40
seconds to 10 minutes to copy tables with similar sizes from a
database to another.
You should be puzzled... that cannot be right unless the copying involves multi level queries feeding the data in one instance, and not so in another.
Is this behaviour random, or do the same copy/query always take the same amount 
of time?

  We are still trying to understand the problem but
I am suspecting that the layout of records play a role here. For
example, if we create 1000 tables and insert records to them in batch,
the data of each table will likely spread evenly across the whole
database, and almost all pages need to be read when we go through
records of a table. One of the things I am trying to do (to improve
the performance of analysis) is to copy tables to several temporary
databases before analysis, and I am hoping that parallelization can
speed up the copy process. Ideally, if we are reading from one
database and feeding several destination databases, we only need to
read the source database once due to the caching of disk contents.
The caches on normal drives are 32 to 64 Megs, on disk arrays you can up this a bit, but nothing will come close to your 512GB+ sizes, so there is hardly any advantage to be gained in the disk caching department. This is what I tried to explain last time, whether you read it once top to bottom, or at several intervals, there is no performance to be gained from the read operation, nor in the writing operation where the intended result is a single DB file once again. The only place you can use the multi-track system to gain performance is if your analysis tools take a lot of processing, and you try to analyze chunks of the data on different processors - or - if you can split the DB into smaller DBs and keep it split, always using it like that.
Anyway, I see a patch that adds 'READONLY' keyword to 'ATTACH
DATABASE' at 
http://old.nabble.com/Re%3A-Question-regarding-modified-ATTACH-command-p26178713.html.
It applies to sqlite-3.6.17 but I will try to adapt it to the current
version of sqlite and see if it helps. I will also try to vacuum the
main database so that data of tables cluster together (not sure if
this will happen, or if it helps performance), which will take days to
complete.
Vacuuming the DB should help, but to be frank, I think you are searching for the proverbial lost coin where the light shines and not where you actually lost it. The problem with the random table copy times mentioned above is a clear indicator that there is something fundamentally wrong with either the database or the hardware on which it resides. Solve that and the other problems should go away.

Please understand I'm not criticising, but I've made these mistakes aplenty in my life so I understand the phenomenon very well: The multi-tracking (for lack of a better word) of the data into split tables and rejoined end result is a construct of your mind's logic thinking, (a hope of a solution as it were) but in reality has no benefit to the problem you are trying to solve. I also think that the obviously ludicrous amounts of time it takes to test any theory on such large databases depletes your patience and you try to come up with less-boring solutions which feels like it would improve speed. (I feel your pain, I've made the mistake in my life of trying fixes out of desperation rather than research).

Don't you have a smaller test bed in which you can test these theories with a smaller subset of the table(s) so it doesnt take days to complete?


_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to