I'm trying to find an efficient way to store simple incrementing integers but I'm having trouble finding an efficient way to do it
My database looks like: CREATE TABLE counters k1, k2, count, -- how many we've seen expires, PRIMARY KEY (k1, k2) ); CREATE INDEX counters_expires_idx ON counters(expires); It is about 1.9gb and contains ~22 million of these rows. A given transaction updates or creates between 10k and 100k of them. At first I was just doing something like this pseducode: update_counter(k1, k2, count=count+1, expires=now+count*1day) if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day) but was having serious performance problems that seems to be confined to those lines. So I converted ir to INSERT OR REPLACE which had no noticeable impact on performance. Convinced the problem was in my code, I decided to offload as much as possible to sqlite. Now my code looks like: ======= cut here ========= PRAGMA synchronous=OFF; PRAGMA temp_store=MEMORY; CREATE TEMPORARY TABLE trans_counters(k1, k2); -- (add all of the items to that temporary table) CREATE TEMPORARY VIEW trans_counters_v AS SELECT k1 AS k1, k2 AS k2, COUNT(*) AS count FROM trans_counters GROUP BY (k1, k2); INSERT OR REPLACE INTO counters SELECT c.k1 AS k1, c.k2 AS k2, COALESCE((SELECT count FROM counters WHERE k1 = c.k1 AND k2 = c.k2), 0)+c.count AS count, (COALESCE((SELECT count FROM counters WHERE k1 = c.k1 AND k2 = c.k2), 0)+c.count)*24*60*60+? AS expires FROM trans_counters_v AS c ======= cut here ========= Now the code that inserts all of the rows into the memory table executes nearly instantly, but the big INSERT takes 15+ minutes. Meanwhile the journal (in either rollback or wal mode) balloons to over 300mb in size. The temporary table itself is only about 1.8mb of data (102,603 rows, 94,064 unique) so where is all of the journal coming from?. The process takes nearly 0 CPU during this time, the disk becomes very active (but low throughput, reading and writing maybe 200k/s judging by the rate of growth of the journal), and sampling the process with OS X's Activity Monitor while it's busy outputs: 100% 2869 _pysqlite_query_execute (in _sqlite3.so) + 1886 [0x101945e5e] 100% 2869 pysqlite_step (in _sqlite3.so) + 47 [0x10194893f] 100% 2869 sqlite3_step (in libsqlite3.dylib) + 1883 [0x7fff8d95ca5b] 100% 2869 sqlite3VdbeExec (in libsqlite3.dylib) + 3327 [0x7fff8d95e3af] 100% 2869 sqlite3BtreeMovetoUnpacked (in libsqlite3.dylib) + 761 [0x7fff8d97ab89] 100% 2869 moveToChild (in libsqlite3.dylib) + 146 [0x7fff8d96c872] 100% 2869 sqlite3PagerAcquire (in libsqlite3.dylib) + 194 [0x7fff8d93dc22] 100% 2869 sqlite3PcacheFetch (in libsqlite3.dylib) + 475 [0x7fff8d93e02b] 100% 2869 pagerStress (in libsqlite3.dylib) + 670 [0x7fff8d9c407e] 100% 2869 pager_write_pagelist (in libsqlite3.dylib) + 149 [0x7fff8d999a35] 100% 2869 unixWrite (in libsqlite3.dylib) + 83 [0x7fff8d98bd73] 100% 2869 pwrite (in libsystem_kernel.dylib) + 10 [0x7fff8130bab6] That is, 2869 of 2869 samples, 100% of the time, was spent in sqlite3_step writing the data to disk. Further samples look basically the same with an occasional read-path taking up to ~10% of the time. VACUUM ANALYZE doesn't look to have any effect. I'm running sqlite 3.7.7 on Mac OS X 10.7.5 via the Python sqlite3 module So I feel like something about what I'm doing is fundamentally flawed given something about sqlite's performance model. All I want is a count of the number of times that I've seen each pair (k1, k2), is there a better way to do this without storing them all individually and grouping them later? (This would be prohibitively large.) _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users