> Doing vacuum on a 288 Gb database is probably going to take some time.

I submitted the command yesterday night and nothing seems to be
happening after 8 hours (sqlite3 is running and there is disk
activity, but I do not see a .journal file).

> I wonder if it would be better on just having the data organized before
> loading it, so that the records in each of the 5000 tables would be
> contiguously stored. Of course, that also depends on how much new data will
> be added to the tables.

I have been thinking about this too, but I do not know how to do it
efficiently.

1. Create a temporary database and copy tables one by one to the main
database. This will take a very long time for databases with for
example 5000 tables.

2. Create single-table databases and merge them to the main database.
Because there is a limit on simultaneously attached databases, I might
not be able to create and attach 5000 databases, insert records, and
copy the tables to the main database afterwards. I might be able to
set SQLITE_LIMIT_ATTACHED dynamically, but I do not know how to do
this in Python sqlite3 module.

3. Create 5000 files, insert records to them, and import the files to
the main database. This seems to be the best option although I need to
pay attention to OS limit on opened files.

4. Insert records to some external tools (e.g. a levelDB database, or
even pipe to a sort command) to sort the records before they are
inserted to sqlite.

Any suggest is welcome.

> How many new tables/records will be added per day/month?

Tables are added in batch and then kept unchanged. I mean, a database
might have 1000 new tables one day, and 2000 later. All operations are
on single tables.

> Are records divided amongst the 5000 tables  based on time so that new
> records will go into new tables rather than be inserted evenly among the
> 5000?

No. The records have to be inserted to their associated tables.

> How many fields in the tables (I understand the 5000 tables are identical in
> structure)
> What type of data is it?

Each table is for one 'sample'. All tables have one column for 'item
id', and optional (zero - perhaps 6) columns for item attributes,
which can be INT or FLOAT.

> Are there any indexes besides the primary key?

There is no index and no key because I was concerned about insertion
performance and size of database. My understanding is that index or
key would not help simple aggregation operations because all records
will be iterated sequentially.

> Unless your calculations are always or mostly on the same fields it is
> probably best not to have any indexes.

I agree.

> Are there any redundancy in the data, e.g. character values which could be
> normalized to separate tables using an integer key reference thus reducing
> the size of the data carrying tables. Converting field contents to integer
> or bigint wherever it is possible may give improvements in both size and
> performance.

Unfortunately no, all columns are int or float. Some columns have a
large proportion of NULLs, but I do not know that before all data are
inserted, so I cannot separate them to auxiliary tables.

> Is the primary key an integer so that it in fact is the rowid thus saving
> some space?
> (see http://sqlite.org/lang_createtable.html, the section: "ROWIDs and the
> INTEGER PRIMARY KEY".)

This certainly worth trying, as long as the performance is acceptable.
It would be good if I can tell sqlite that there is no duplicate and
missing values so it does not have to check.

> If there are several calculations to be done for each table it would perhaps
> be better to work with several threads concentrating on a single table at a
> time. This would be particularly effective if your cache_size was set large
> enough for a table to be contained entirely in memory as the threads could
> share the same connection and (as I understand it) memory cache (see
> http://www.sqlite.org/sharedcache.html).

I am running SELECT COUNT(*), MAX(QT) FROM TABLE_XX; etc as long as
the WHERE clause is the same. I am also running queries in multiple
threads which works well on SSD drive, but not on mechanical drive due
to the bottleneck on random-access speed. I am not sure if we are
talking about the same cache but because the tables are spread
sparsely across the database, caching more pages did not really help.

> One might even consider a different strategy:
> Do not use the statistical function of SQLite (count, average, max, etc.)
> but access the records individually in your code. This requires that data
> are loaded according to the primary key and that the threads accessing them
> do so in a turn based fashion, using the modulo function to decide which
> thread gets to handle the record, e.g.:

If all tables have the same set of item IDs, this can be a really good
idea. This is unfortunately not the case because each tables have a
different set of IDs, despite of 70% to 90% overlapping keys. I even
do not know in advance all available IDs.

Cheers,
Bo
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to