> boun...@sqlite.org] On Behalf Of Bo Peng
> 
> I will do this multiple times, with different conditions (e.g. SELECT
> MAX(c) FROM TABLE_X WHRE b > 1.0) so maintaining number of rows would
> not help. I intentionally avoided TRIGGERs because of the large amount
> (billions) of data inserted.
> 
> Other than using a SSD to speed up random access, I hope a VACUUM
> operation would copy tables one by one so content of the tables would not
> scatter around the whole database. If this is the case, disk caching
should
> work much better after VACUUM... fingers crossed.

Doing vacuum on a 288 Gb database is probably going to take some time. I
wonder if it would be better on just having the data organized before
loading it, so that the records in each of the 5000 tables would be
contiguously stored. Of course, that also depends on how much new data will
be added to the tables. 

Having worked with large databases and reading through this mail thread,
some questions and ideas come to mind:

How many new tables/records will be added per day/month?

Are records divided amongst the 5000 tables  based on time so that new
records will go into new tables rather than be inserted evenly among the
5000?

How many fields in the tables (I understand the 5000 tables are identical in
structure) 
What type of data is it?

Are there any indexes besides the primary key?
Unless your calculations are always or mostly on the same fields it is
probably best not to have any indexes.

Are there any redundancy in the data, e.g. character values which could be
normalized to separate tables using an integer key reference thus reducing
the size of the data carrying tables. Converting field contents to integer
or bigint wherever it is possible may give improvements in both size and
performance.

Is the primary key an integer so that it in fact is the rowid thus saving
some space?
(see http://sqlite.org/lang_createtable.html, the section: "ROWIDs and the
INTEGER PRIMARY KEY".)

If there are several calculations to be done for each table it would perhaps
be better to work with several threads concentrating on a single table at a
time. This would be particularly effective if your cache_size was set large
enough for a table to be contained entirely in memory as the threads could
share the same connection and (as I understand it) memory cache (see
http://www.sqlite.org/sharedcache.html) . Hereby the CPU and memory can be
put to good use rather than working with several independent connections
which do not utilize CPU and memory efficiently but just push the disk heads
around.

One might even consider a different strategy:
Do not use the statistical function of SQLite (count, average, max, etc.)
but access the records individually in your code. This requires that data
are loaded according to the primary key and that the threads accessing them
do so in a turn based fashion, using the modulo function to decide which
thread gets to handle the record, e.g.:

Key    Thread
100    1
102    2
103    3
104    4
105    1
106    2
Etc.

(Sorry if the explanation is too explicit :-).

Thus you can have an army of threads using CPU and memory most efficiently
as the threads will be handling the same pages of the database which will be
cached in memory. Again, this requires that the records are physically
stored according to the key.
There is perhaps some overhead occurred by accessing the individual records
rather than letting SQLite do the count, max, etc. However, if there are
several calculations to be done on the same data, it may prove more
efficient this way. 

You could even consider having less tables so that the threads can work for
longer without having to switch to the next table. 


/Frank Missel

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to