> boun...@sqlite.org] On Behalf Of Bo Peng > > I will do this multiple times, with different conditions (e.g. SELECT > MAX(c) FROM TABLE_X WHRE b > 1.0) so maintaining number of rows would > not help. I intentionally avoided TRIGGERs because of the large amount > (billions) of data inserted. > > Other than using a SSD to speed up random access, I hope a VACUUM > operation would copy tables one by one so content of the tables would not > scatter around the whole database. If this is the case, disk caching should > work much better after VACUUM... fingers crossed.
Doing vacuum on a 288 Gb database is probably going to take some time. I wonder if it would be better on just having the data organized before loading it, so that the records in each of the 5000 tables would be contiguously stored. Of course, that also depends on how much new data will be added to the tables. Having worked with large databases and reading through this mail thread, some questions and ideas come to mind: How many new tables/records will be added per day/month? Are records divided amongst the 5000 tables based on time so that new records will go into new tables rather than be inserted evenly among the 5000? How many fields in the tables (I understand the 5000 tables are identical in structure) What type of data is it? Are there any indexes besides the primary key? Unless your calculations are always or mostly on the same fields it is probably best not to have any indexes. Are there any redundancy in the data, e.g. character values which could be normalized to separate tables using an integer key reference thus reducing the size of the data carrying tables. Converting field contents to integer or bigint wherever it is possible may give improvements in both size and performance. Is the primary key an integer so that it in fact is the rowid thus saving some space? (see http://sqlite.org/lang_createtable.html, the section: "ROWIDs and the INTEGER PRIMARY KEY".) If there are several calculations to be done for each table it would perhaps be better to work with several threads concentrating on a single table at a time. This would be particularly effective if your cache_size was set large enough for a table to be contained entirely in memory as the threads could share the same connection and (as I understand it) memory cache (see http://www.sqlite.org/sharedcache.html) . Hereby the CPU and memory can be put to good use rather than working with several independent connections which do not utilize CPU and memory efficiently but just push the disk heads around. One might even consider a different strategy: Do not use the statistical function of SQLite (count, average, max, etc.) but access the records individually in your code. This requires that data are loaded according to the primary key and that the threads accessing them do so in a turn based fashion, using the modulo function to decide which thread gets to handle the record, e.g.: Key Thread 100 1 102 2 103 3 104 4 105 1 106 2 Etc. (Sorry if the explanation is too explicit :-). Thus you can have an army of threads using CPU and memory most efficiently as the threads will be handling the same pages of the database which will be cached in memory. Again, this requires that the records are physically stored according to the key. There is perhaps some overhead occurred by accessing the individual records rather than letting SQLite do the count, max, etc. However, if there are several calculations to be done on the same data, it may prove more efficient this way. You could even consider having less tables so that the threads can work for longer without having to switch to the next table. /Frank Missel _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users