Hi, RSmith,

>> One of the problems we have is the performance of analyzing large
>> tables one by one. For example, we are puzzled that it takes from 40
>> seconds to 10 minutes to copy tables with similar sizes from a
>> database to another.
>
> You should be puzzled... that cannot be right unless the copying involves
> multi level queries feeding the data in one instance, and not so in another.
> Is this behaviour random, or do the same copy/query always take the same
> amount of time?

We are still measuring the performance of such operations under
different conditions. It seems that the performance get worse when the
database gets larger (with more tables about the same size), and when
there are multiple instances of the application. The latter is
understandable and we are addressing it by moving some disk i/o to
another harddrive, and the former seems to be caused by the 'data
spread evenly across database' issue that I described.

> The caches on normal drives are 32 to 64 Megs, on disk arrays you can up
> this a bit, but nothing will come close to your 512GB+ sizes, so there is
> hardly any advantage to be gained in the disk caching department. This is
> what I tried to explain last time, whether you read it once top to bottom,
> or at several intervals, there is no performance to be gained from the read
> operation, nor in the writing operation where the intended result is a
> single DB file once again. The only place you can use the multi-track system
> to gain performance is if your analysis tools take a lot of processing, and
> you try to analyze chunks of the data on different processors - or - if you
> can split the DB into smaller DBs and keep it split, always using it like
> that.

You are absolutely right. I did some testing yesterday night and saw
only worse performance with parallel reading. (Instead of using ro
main db + rw attached db, I used two database connections opened in
Python.)

> Please understand I'm not criticising, but I've made these mistakes aplenty
> in my life so I understand the phenomenon very well:  The multi-tracking
> (for lack of a better word) of the data into split tables and rejoined end
> result is a construct of your mind's logic thinking, (a hope of a solution
> as it were) but in reality has no benefit to the problem you are trying to
> solve. I also think that the obviously ludicrous amounts of time it takes to
> test any theory on such large databases depletes your patience and you try
> to come up with less-boring solutions which feels like it would improve
> speed. (I feel your pain, I've made the mistake in my life of trying fixes
> out of desperation rather than research).

Again, you are reading my mind. I am wasting my life here waiting for
the data to be imported and processed (days to import, days to
analysis) and I felt that I have to do something to save my life. I
was suggested to move to oracle other "more advanced" DBMS but initial
tests with mysql was very disappointing.

> Don't you have a smaller test bed in which you can test these theories with
> a smaller subset of the table(s) so it doesnt take days to complete?

The problems we have only happens with large databases and it is
difficult to create test cases to reproduce them. Because parallel
reading does not really help, I think I will have to split the
database into smaller ones (which is much more difficult to manage).
Before that, I might try to create a feeder processes that prepare
data when the previous batch of data is being analyzed.

Thank you again for your help,
Bo
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to