Re: [sqlite] Efficient way to store counters
> Actually my understanding would suggest that INSERT OR REPLACE should > execute slower than UPDATE + INSERT (or INSERT + UPDATE). After some experimentation, I think that they turn out to be a wash in my case. INSERT OR REPLACE is implemented in sqlite as INSERT OR DELETE THEN INSERT, which due to my access patterns puts the really common (k1, k2) pairs together at the end of the database (or at least with the highest row ids, which tend to group together), which increases the chances to share pages for those common pairs. > And about journal size: imagine that you've got "lucky" and > all those 94k rows are each in it's own page in the counters table. > SQLite will have to save each of that pages in the journal which will > give journal size of about 94k * 4096 ~ 400M. Maybe it makes sense to drop my page size then all the way down to the disc's sector size. That reduces the changes that a given set of pairs shares pages but clearly that's already very low. It would at least reduce the number of bytes needing to be written, even if the number of seeks stays essentially the same. But since it's seeks that are probably dominating the I/O, maybe it makes sense to increase the transaction batch size so that we're overwriting most of the DB on every commit rather than overwriting (say) 10%, 10% as often. Does that make sense? ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient way to store counters
> > > BTW, in case you don't do that yet your best performance will be if > > > you prepare your UPDATE and INSERT statements only once and then do > > > bind + step + reset in that 100k times loop. > > > > > > > In principle I agree, but since the temporary-table version is blindingly > > fast up the the update-the-disk portion it's definitely not a bottleneck at > > this point > > > > I was talking about your initial implementation when you did 100k times > > update_counter(k1, k2, count=count+1, expires=now+count*1day) > > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day) > > > > Not about your final version with one INSERT OR REPLACE. Was your > statement about the same thing? If yes I didn't understand what you > meant. > > I just meant that the naïve way of making the prepared statements with python's sqlite3 module (which it may or may not cache, but I assume doesn't) was already so fast that I'm not worried about shaving a few milliseconds off of re-preparing the statements every time when the actual problem occurs at a lower level than that. So yeah, preparing the statement once and re-binding it every time would speed things up, but so little that I'd rather solve the real problem of reducing the time taken by the disk-writes first ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient way to store counters
On Wed, Mar 13, 2013 at 11:48 AM, David King wrote: >> BTW, in case you don't do that yet your best performance will be if >> you prepare your UPDATE and INSERT statements only once and then do >> bind + step + reset in that 100k times loop. > > > In principle I agree, but since the temporary-table version is blindingly > fast up the the update-the-disk portion it's definitely not a bottleneck at > this point I was talking about your initial implementation when you did 100k times > update_counter(k1, k2, count=count+1, expires=now+count*1day) > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day) Not about your final version with one INSERT OR REPLACE. Was your statement about the same thing? If yes I didn't understand what you meant. Pavel ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient way to store counters
On Wed, 13 Mar 2013 14:19:05 -0400 Igor Tandetnik wrote: > > I'm not sure about SQLite, but in principle the query optimizer can > > often use the base table's index for a derived value. Consider > > > > A join B on A.a = 1 + B.a > > or > > A join B on A.a < sqrt(B.a) > > > > An index on B.a is useful to finding the values meeting the > > criterion. > > You seem to expect the optimizer to solve equations - to effectively > rewrite the conditions as "B.a = A.a - 1" and "B.a >= 0 and B.a > > (case when A.a < 0 then 0 else A.a * A.a end)". I'm pretty sure no > major DBMS does that. Definitely SQLite doesn't. Thanks for clarifying that. As for major DBMSs, I dealt with Microsoft's for years, from Sybase days. Joins based on date functions and integer arithmetic observably uses indexes. Floating point I'm not sure of; I can't recall a table indexed on a floating point column. --jkl ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient way to store counters
> > The logic is, "keep a given (k1, k2) pair around for one day for each > > time it's been seen". I could calculate it when it's needed, but > > since it's a computed value, I couldn't then have an index on it. > > I'm not sure about SQLite, but in principle the query optimizer can > often use the base table's index for a derived value. Consider > > A join B on A.a = 1 + B.a > or > A join B on A.a < sqrt(B.a) > > An index on B.a is useful to finding the values meeting the criterion. > > But perhaps you've measured this. How much faster is the updating > process you're concerned with than the SELECT that would avoid it? I've measured the performance gain of leaving off that column (and therefore index) entirely. It buys me less than my rounding error in performance for the updates. I only left it in the problem description for completeness. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient way to store counters
> BTW, in case you don't do that yet your best performance will be if > you prepare your UPDATE and INSERT statements only once and then do > bind + step + reset in that 100k times loop. In principle I agree, but since the temporary-table version is blindingly fast up the the update-the-disk portion it's definitely not a bottleneck at this point ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient way to store counters
On Wednesday, 13 March, 2013 at 06:15, Michael Black wrote: > You're simply missing the where clause on your update so you're updating the > entire database every time you do an insert. > update_counter(k1,k2 count=count+1,expires=now+count*1day) where field1=k1 > and field2=k2; > > And a (k1,k2) index would help that update a lot. That's pseudo code. In real life it looks more like UPDATE counters SET count=count+1, expires=?+count*(24*60*60) WHERE k1=? AND k2=? As in the schema I listed, (k1,k2) is the primary key so there's already an index on it. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient way to store counters
On 3/13/2013 8:49 AM, James K. Lowden wrote: I'm not sure about SQLite, but in principle the query optimizer can often use the base table's index for a derived value. Consider A join B on A.a = 1 + B.a or A join B on A.a < sqrt(B.a) An index on B.a is useful to finding the values meeting the criterion. You seem to expect the optimizer to solve equations - to effectively rewrite the conditions as "B.a = A.a - 1" and "B.a >= 0 and B.a > (case when A.a < 0 then 0 else A.a * A.a end)". I'm pretty sure no major DBMS does that. Definitely SQLite doesn't. -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient way to store counters
On Tue, Mar 12, 2013 at 11:03 PM, David King wrote: >> > At first I was just doing something like this pseducode: >> > update_counter(k1, k2, count=count+1, expires=now+count*1day) >> > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day) >> >> Assuming these 2 statements constitute each of the 10k-100k steps you >> mentioned above and all of these steps are wrapped up in BEGIN-COMMIT >> block this is probably the most efficient way of doing this. The only >> improvement could be if you are doing creates more often than updates. >> Then you can switch and do INSERT first and then UPDATE if necessary. >> It could gain you a little time. > > > Yeah. I even tried keeping track of how many hits/misses I had and > re-ordering the attempted INSERT/UPDATE as appropriate. A batch of 100k of > these is done in a single transaction > >> > but was having serious performance problems that seems to be confined to >> > those lines. So I converted ir to INSERT OR REPLACE which had no >> > noticeable impact on performance. >> Actually my understanding would suggest that INSERT OR REPLACE should >> execute slower than UPDATE + INSERT (or INSERT + UPDATE). > > […] >> > Convinced the problem was in my code, I decided to offload as much as >> > possible to sqlite. Now my code looks like: >> >> This should be much-much slower than UPDATE + INSERT. > > > That's unfortunate because the overall performance was about the same ±10% > between all three approaches :( > >> First of all in the statement above you don't gain benefit from >> uniqueness and replace about 10k rows twice. > > > Are you sure? The SELECT in the INSERT OR UPDATE selects "FROM > trans_counters_v AS c", the grouped temporary view. So it should only see any > given key pair once before it starts doing any inserting at all > >> Second with such low >> repeatability you don't gain much from doing it with such complicated >> INSERT. And about journal size: imagine that you've got "lucky" and >> all those 94k rows are each in it's own page in the counters table. >> SQLite will have to save each of that pages in the journal which will >> give journal size of about 94k * 4096 ~ 400M. > > > I hadn't thought about it that way, that's true. And it's probably wildly > seeking all over the disk to do it. The reads are probably fine because the > machine has plenty of RAM to devote to page cache, it's the random writes > that are killing it. > >> I don't think there's anything better than what you did initially. > > As for the fundamental approach, I figured as much. The rearrangement into > the giant INSERT OR REPLACE was just to prove to myself that the problem > wasn't elsewhere in my code > > For optimising it on the sqlite front, I've played with page sizes, > journaling modes, and changing the transaction batch size without much luck. > I don't have strong consistency requirements for e.g. power failures or OS > crashes but I do need an application crash to not take it out so I can't just > go without the journal altogether (which does help the problem, but isn't > huge). BTW, in case you don't do that yet your best performance will be if you prepare your UPDATE and INSERT statements only once and then do bind + step + reset in that 100k times loop. Pavel ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient way to store counters
You're simply missing the where clause on your update so you're updating the entire database every time you do an insert. update_counter(k1,k2 count=count+1,expires=now+count*1day) where field1=k1 and field2=k2; And a (k1,k2) index would help that update a lot. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient way to store counters
On Tue, 12 Mar 2013 21:20:11 -0700 David King wrote: > > > At first I was just doing something like this pseducode: > > > update_counter(k1, k2, count=count+1, expires=now+count*1day) > > > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now > > > +1day) > > > > Might I suggest that instead of trying to store an ever-changing > > value, you simply figure it out when it's needed ? > > > The logic is, "keep a given (k1, k2) pair around for one day for each > time it's been seen". I could calculate it when it's needed, but > since it's a computed value, I couldn't then have an index on it. I'm not sure about SQLite, but in principle the query optimizer can often use the base table's index for a derived value. Consider A join B on A.a = 1 + B.a or A join B on A.a < sqrt(B.a) An index on B.a is useful to finding the values meeting the criterion. But perhaps you've measured this. How much faster is the updating process you're concerned with than the SELECT that would avoid it? --jkl ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient way to store counters
>> First of all in the statement above you don't gain benefit from >> uniqueness and replace about 10k rows twice. > > Are you sure? The SELECT in the INSERT OR UPDATE selects "FROM > trans_counters_v AS c", the grouped temporary view. So it should only see any > given key pair once before it starts doing any inserting at all Sorry, you are right. I missed the GROUP BY part... Pavel On Tue, Mar 12, 2013 at 11:03 PM, David King wrote: >> > At first I was just doing something like this pseducode: >> > update_counter(k1, k2, count=count+1, expires=now+count*1day) >> > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day) >> >> Assuming these 2 statements constitute each of the 10k-100k steps you >> mentioned above and all of these steps are wrapped up in BEGIN-COMMIT >> block this is probably the most efficient way of doing this. The only >> improvement could be if you are doing creates more often than updates. >> Then you can switch and do INSERT first and then UPDATE if necessary. >> It could gain you a little time. > > > Yeah. I even tried keeping track of how many hits/misses I had and > re-ordering the attempted INSERT/UPDATE as appropriate. A batch of 100k of > these is done in a single transaction > >> > but was having serious performance problems that seems to be confined to >> > those lines. So I converted ir to INSERT OR REPLACE which had no >> > noticeable impact on performance. >> Actually my understanding would suggest that INSERT OR REPLACE should >> execute slower than UPDATE + INSERT (or INSERT + UPDATE). > > […] >> > Convinced the problem was in my code, I decided to offload as much as >> > possible to sqlite. Now my code looks like: >> >> This should be much-much slower than UPDATE + INSERT. > > > That's unfortunate because the overall performance was about the same ±10% > between all three approaches :( > >> First of all in the statement above you don't gain benefit from >> uniqueness and replace about 10k rows twice. > > > Are you sure? The SELECT in the INSERT OR UPDATE selects "FROM > trans_counters_v AS c", the grouped temporary view. So it should only see any > given key pair once before it starts doing any inserting at all > >> Second with such low >> repeatability you don't gain much from doing it with such complicated >> INSERT. And about journal size: imagine that you've got "lucky" and >> all those 94k rows are each in it's own page in the counters table. >> SQLite will have to save each of that pages in the journal which will >> give journal size of about 94k * 4096 ~ 400M. > > > I hadn't thought about it that way, that's true. And it's probably wildly > seeking all over the disk to do it. The reads are probably fine because the > machine has plenty of RAM to devote to page cache, it's the random writes > that are killing it. > >> I don't think there's anything better than what you did initially. > > As for the fundamental approach, I figured as much. The rearrangement into > the giant INSERT OR REPLACE was just to prove to myself that the problem > wasn't elsewhere in my code > > For optimising it on the sqlite front, I've played with page sizes, > journaling modes, and changing the transaction batch size without much luck. > I don't have strong consistency requirements for e.g. power failures or OS > crashes but I do need an application crash to not take it out so I can't just > go without the journal altogether (which does help the problem, but isn't > huge). > > > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient way to store counters
> > At first I was just doing something like this pseducode: > > update_counter(k1, k2, count=count+1, expires=now+count*1day) > > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day) > > Assuming these 2 statements constitute each of the 10k-100k steps you > mentioned above and all of these steps are wrapped up in BEGIN-COMMIT > block this is probably the most efficient way of doing this. The only > improvement could be if you are doing creates more often than updates. > Then you can switch and do INSERT first and then UPDATE if necessary. > It could gain you a little time. Yeah. I even tried keeping track of how many hits/misses I had and re-ordering the attempted INSERT/UPDATE as appropriate. A batch of 100k of these is done in a single transaction > > but was having serious performance problems that seems to be confined to > > those lines. So I converted ir to INSERT OR REPLACE which had no noticeable > > impact on performance. > Actually my understanding would suggest that INSERT OR REPLACE should > execute slower than UPDATE + INSERT (or INSERT + UPDATE). […] > > Convinced the problem was in my code, I decided to offload as much as > > possible to sqlite. Now my code looks like: > > This should be much-much slower than UPDATE + INSERT. That's unfortunate because the overall performance was about the same ±10% between all three approaches :( > First of all in the statement above you don't gain benefit from > uniqueness and replace about 10k rows twice. Are you sure? The SELECT in the INSERT OR UPDATE selects "FROM trans_counters_v AS c", the grouped temporary view. So it should only see any given key pair once before it starts doing any inserting at all > Second with such low > repeatability you don't gain much from doing it with such complicated > INSERT. And about journal size: imagine that you've got "lucky" and > all those 94k rows are each in it's own page in the counters table. > SQLite will have to save each of that pages in the journal which will > give journal size of about 94k * 4096 ~ 400M. I hadn't thought about it that way, that's true. And it's probably wildly seeking all over the disk to do it. The reads are probably fine because the machine has plenty of RAM to devote to page cache, it's the random writes that are killing it. > I don't think there's anything better than what you did initially. As for the fundamental approach, I figured as much. The rearrangement into the giant INSERT OR REPLACE was just to prove to myself that the problem wasn't elsewhere in my code For optimising it on the sqlite front, I've played with page sizes, journaling modes, and changing the transaction batch size without much luck. I don't have strong consistency requirements for e.g. power failures or OS crashes but I do need an application crash to not take it out so I can't just go without the journal altogether (which does help the problem, but isn't huge). ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient way to store counters
On Tue, Mar 12, 2013 at 8:29 PM, David King wrote: > I'm trying to find an efficient way to store simple incrementing integers but > I'm having trouble finding an efficient way to do it > > My database looks like: > > CREATE TABLE counters > k1, k2, > count, -- how many we've seen > expires, > PRIMARY KEY (k1, k2) > ); > CREATE INDEX counters_expires_idx ON counters(expires); > > It is about 1.9gb and contains ~22 million of these rows. A given transaction > updates or creates between 10k and 100k of them. > > At first I was just doing something like this pseducode: > > update_counter(k1, k2, count=count+1, expires=now+count*1day) > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day) Assuming these 2 statements constitute each of the 10k-100k steps you mentioned above and all of these steps are wrapped up in BEGIN-COMMIT block this is probably the most efficient way of doing this. The only improvement could be if you are doing creates more often than updates. Then you can switch and do INSERT first and then UPDATE if necessary. It could gain you a little time. > but was having serious performance problems that seems to be confined to > those lines. So I converted ir to INSERT OR REPLACE which had no noticeable > impact on performance. Actually my understanding would suggest that INSERT OR REPLACE should execute slower than UPDATE + INSERT (or INSERT + UPDATE). > Convinced the problem was in my code, I decided to offload as much as > possible to sqlite. Now my code looks like: > > === cut here = > > PRAGMA synchronous=OFF; > PRAGMA temp_store=MEMORY; > > > > CREATE TEMPORARY TABLE trans_counters(k1, k2); > > -- (add all of the items to that temporary table) > > CREATE TEMPORARY VIEW trans_counters_v AS > SELECT k1 AS k1, > k2 AS k2, > COUNT(*) AS count > FROM trans_counters > GROUP BY (k1, k2); > > > > INSERT OR REPLACE INTO counters > SELECT c.k1 AS k1, > c.k2 AS k2, > COALESCE((SELECT count FROM counters WHERE k1 = c.k1 AND k2 = c.k2), > 0)+c.count AS count, > (COALESCE((SELECT count FROM counters WHERE k1 = c.k1 AND k2 = c.k2), > 0)+c.count)*24*60*60+? AS expires > FROM trans_counters_v AS c This should be much-much slower than UPDATE + INSERT. > === cut here = > > Now the code that inserts all of the rows into the memory table executes > nearly instantly, but the big INSERT takes 15+ minutes. Meanwhile the journal > (in either rollback or wal mode) balloons to over 300mb in size. The > temporary table itself is only about 1.8mb of data (102,603 rows, 94,064 > unique) so where is all of the journal coming from? First of all in the statement above you don't gain benefit from uniqueness and replace about 10k rows twice. Second with such low repeatability you don't gain much from doing it with such complicated INSERT. And about journal size: imagine that you've got "lucky" and all those 94k rows are each in it's own page in the counters table. SQLite will have to save each of that pages in the journal which will give journal size of about 94k * 4096 ~ 400M. > The process takes nearly 0 CPU during this time, the disk becomes very active > (but low throughput, reading and writing maybe 200k/s judging by the rate of > growth of the journal), and sampling the process with OS X's Activity Monitor > while it's busy outputs: > > 100% 2869 _pysqlite_query_execute (in _sqlite3.so) + 1886 [0x101945e5e] > 100% 2869 pysqlite_step (in _sqlite3.so) + 47 [0x10194893f] > 100% 2869 sqlite3_step (in libsqlite3.dylib) + 1883 [0x7fff8d95ca5b] > 100% 2869 sqlite3VdbeExec (in libsqlite3.dylib) + 3327 [0x7fff8d95e3af] > 100% 2869 sqlite3BtreeMovetoUnpacked (in libsqlite3.dylib) + 761 > [0x7fff8d97ab89] > 100% 2869 moveToChild (in libsqlite3.dylib) + 146 [0x7fff8d96c872] > 100% 2869 sqlite3PagerAcquire (in libsqlite3.dylib) + 194 [0x7fff8d93dc22] > 100% 2869 sqlite3PcacheFetch (in libsqlite3.dylib) + 475 [0x7fff8d93e02b] > 100% 2869 pagerStress (in libsqlite3.dylib) + 670 [0x7fff8d9c407e] > 100% 2869 pager_write_pagelist (in libsqlite3.dylib) + 149 [0x7fff8d999a35] > 100% 2869 unixWrite (in libsqlite3.dylib) + 83 [0x7fff8d98bd73] > 100% 2869 pwrite (in libsystem_kernel.dylib) + 10 [0x7fff8130bab6] > > > > That is, 2869 of 2869 samples, 100% of the time, was spent in sqlite3_step > writing the data to disk. Further samples look basically the same with an > occasional read-path taking up to ~10% of the time. > > VACUUM ANALYZE doesn't look to have any effect. I'm running sqlite 3.7.7 on > Mac OS X 10.7.5 via the Python sqlite3 module > > So I feel like something about what I'm doing is fundamentally flawed given > something about sqlite's performance model. All I want is a count of the > number of times that I've seen each pair (k1, k2), is there a better way to > do this without storing them all individually and grouping them later? (This > would be prohibitively large.) I don't think there's anything better than what you did initially. Pavel __
Re: [sqlite] Efficient way to store counters
> > At first I was just doing something like this pseducode: > > update_counter(k1, k2, count=count+1, expires=now+count*1day) > > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day) > > Might I suggest that instead of trying to store an ever-changing value, you > simply figure it out when it's needed ? I don't quite understand the logic > you're applying to calculate your 'expires' value but I think it could be > calculated with a SELECT whenever you needed it rather than being stored. The logic is, "keep a given (k1, k2) pair around for one day for each time it's been seen". I could calculate it when it's needed, but since it's a computed value, I couldn't then have an index on it. That said, I've removed this in several tests and it doesn't appear have any bearing on the performance issue. > > Now the code that inserts all of the rows into the memory table executes > > nearly instantly, but the big INSERT takes 15+ minutes. Meanwhile the > > journal (in either rollback or wal mode) balloons to over 300mb in size. > You don't list any indexes which would help with your WHERE clauses, so I > suspect SQLite is having to look through all the records in 'counters' in > order to find the rows it needs for each COALESCE. The large size of the > journal is because you are replacing every row in the databases. Maybe I wasn't clear, the ---cut here--- bit is in addition to the existing schema (after all, the INSERT OR REPLACE updates it, so surely it must already exist). In the INSERT OR REPLACE operation there is no WHERE clause. (k1, k2) is the primary key on the 'counters' table, so the INSERT OR REPLACE takes each value out of the temporary trans_counters_v view of the in-memory trans_counters temporary table and either replaces or inserts a value for each corresponding entry in 'counters'. AFAICT, no amount of indexing here would help. There's no reason to index the temporary table, since the sort for the GROUP BY is n*log(n), and updating the index for each individual entry would still sum to n*log(n). The temporary table is single-use and will have to be scanned every time regardless. (And if I isolate that GROUP BY operation, it's blindingly fast.). Here's the EXPLAIN QUERY PLAN for the INSERT OR REPLACE: (0, 0, 0, u'SCAN TABLE trans_counters (132971 rows)') (0, 0, 0, u'USE TEMP B-TREE FOR GROUP BY') (0, 0, 0, u'EXECUTE CORRELATED SCALAR SUBQUERY 1') (1, 0, 0, u'SEARCH TABLE counters USING INDEX sqlite_autoindex_counters_1 (k1=? AND k2=?) (~1 rows)') (0, 0, 0, u'EXECUTE CORRELATED SCALAR SUBQUERY 2') (2, 0, 0, u'SEARCH TABLE counters USING INDEX sqlite_autoindex_counters_1 (k1=? AND k2=?) (~1 rows)') So no, it's not having to replace every entry in the 'counters' table. Also, if it were replacing every row in the database, then the journal would grow to equal the (VACUUM'd) size of the database, but it doesn't get that big. It gets to 300mb+, but not to the full size of 1.9gb. > > So I feel like something about what I'm doing is fundamentally flawed given > > something about sqlite's performance model. All I want is a count of the > > number of times that I've seen each pair (k1, k2), is there a better way to > > do this without storing them all individually and grouping them later? > > If you have a table with two columns k1, k2 in, and you want to count the > times each pair occurs, you can do it in software far faster by having this > index and using this SELECT > CREATE INDEX myTable_keypair ON myTable (k1,k2) > SELECT k1,k2 from myTable ORDER BY k1,k2 > you might even use one of the following if you know it will always return a > unique value for unique keys > SELECT k1||k2 from myTable ORDER BY k1,k2 > SELECT k1||':'||k2 from myTable ORDER BY k1,k2 > Just count the unique values in your programming language as they go past. > Yes, you can use horrendous complication to make SQLite present a neatly > formatted return with the counts included, but defining that in SQL makes > SQLite do more work than your programming language would need to do. The problem isn't grouping them to count them. If I evaluate the TEMPORARY VIEW that does the GROUP BY clause above it completes in just under a second for all 100k items in the table it references. That part is very fast. The slow bit is incorporating those counts into the versions on disk. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient way to store counters
On 13 Mar 2013, at 3:29am, David King wrote: > At first I was just doing something like this pseducode: > > update_counter(k1, k2, count=count+1, expires=now+count*1day) > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day) Might I suggest that instead of trying to store an ever-changing value, you simply figure it out when it's needed ? I don't quite understand the logic you're applying to calculate your 'expires' value but I think it could be calculated with a SELECT whenever you needed it rather than being stored. > Now the code that inserts all of the rows into the memory table executes > nearly instantly, but the big INSERT takes 15+ minutes. Meanwhile the journal > (in either rollback or wal mode) balloons to over 300mb in size. You don't list any indexes which would help with your WHERE clauses, so I suspect SQLite is having to look through all the records in 'counters' in order to find the rows it needs for each COALESCE. The large size of the journal is because you are replacing every row in the databases. > So I feel like something about what I'm doing is fundamentally flawed given > something about sqlite's performance model. All I want is a count of the > number of times that I've seen each pair (k1, k2), is there a better way to > do this without storing them all individually and grouping them later? If you have a table with two columns k1, k2 in, and you want to count the times each pair occurs, you can do it in software far faster by having this index and using this SELECT CREATE INDEX myTable_keypair ON myTable (k1,k2) SELECT k1,k2 from myTable ORDER BY k1,k2 you might even use one of the following if you know it will always return a unique value for unique keys SELECT k1||k2 from myTable ORDER BY k1,k2 SELECT k1||':'||k2 from myTable ORDER BY k1,k2 Just count the unique values in your programming language as they go past. Yes, you can use horrendous complication to make SQLite present a neatly formatted return with the counts included, but defining that in SQL makes SQLite do more work than your programming language would need to do. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Efficient way to store counters
I'm trying to find an efficient way to store simple incrementing integers but I'm having trouble finding an efficient way to do it My database looks like: CREATE TABLE counters k1, k2, count, -- how many we've seen expires, PRIMARY KEY (k1, k2) ); CREATE INDEX counters_expires_idx ON counters(expires); It is about 1.9gb and contains ~22 million of these rows. A given transaction updates or creates between 10k and 100k of them. At first I was just doing something like this pseducode: update_counter(k1, k2, count=count+1, expires=now+count*1day) if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day) but was having serious performance problems that seems to be confined to those lines. So I converted ir to INSERT OR REPLACE which had no noticeable impact on performance. Convinced the problem was in my code, I decided to offload as much as possible to sqlite. Now my code looks like: === cut here = PRAGMA synchronous=OFF; PRAGMA temp_store=MEMORY; CREATE TEMPORARY TABLE trans_counters(k1, k2); -- (add all of the items to that temporary table) CREATE TEMPORARY VIEW trans_counters_v AS SELECT k1 AS k1, k2 AS k2, COUNT(*) AS count FROM trans_counters GROUP BY (k1, k2); INSERT OR REPLACE INTO counters SELECT c.k1 AS k1, c.k2 AS k2, COALESCE((SELECT count FROM counters WHERE k1 = c.k1 AND k2 = c.k2), 0)+c.count AS count, (COALESCE((SELECT count FROM counters WHERE k1 = c.k1 AND k2 = c.k2), 0)+c.count)*24*60*60+? AS expires FROM trans_counters_v AS c === cut here = Now the code that inserts all of the rows into the memory table executes nearly instantly, but the big INSERT takes 15+ minutes. Meanwhile the journal (in either rollback or wal mode) balloons to over 300mb in size. The temporary table itself is only about 1.8mb of data (102,603 rows, 94,064 unique) so where is all of the journal coming from?. The process takes nearly 0 CPU during this time, the disk becomes very active (but low throughput, reading and writing maybe 200k/s judging by the rate of growth of the journal), and sampling the process with OS X's Activity Monitor while it's busy outputs: 100% 2869 _pysqlite_query_execute (in _sqlite3.so) + 1886 [0x101945e5e] 100% 2869 pysqlite_step (in _sqlite3.so) + 47 [0x10194893f] 100% 2869 sqlite3_step (in libsqlite3.dylib) + 1883 [0x7fff8d95ca5b] 100% 2869 sqlite3VdbeExec (in libsqlite3.dylib) + 3327 [0x7fff8d95e3af] 100% 2869 sqlite3BtreeMovetoUnpacked (in libsqlite3.dylib) + 761 [0x7fff8d97ab89] 100% 2869 moveToChild (in libsqlite3.dylib) + 146 [0x7fff8d96c872] 100% 2869 sqlite3PagerAcquire (in libsqlite3.dylib) + 194 [0x7fff8d93dc22] 100% 2869 sqlite3PcacheFetch (in libsqlite3.dylib) + 475 [0x7fff8d93e02b] 100% 2869 pagerStress (in libsqlite3.dylib) + 670 [0x7fff8d9c407e] 100% 2869 pager_write_pagelist (in libsqlite3.dylib) + 149 [0x7fff8d999a35] 100% 2869 unixWrite (in libsqlite3.dylib) + 83 [0x7fff8d98bd73] 100% 2869 pwrite (in libsystem_kernel.dylib) + 10 [0x7fff8130bab6] That is, 2869 of 2869 samples, 100% of the time, was spent in sqlite3_step writing the data to disk. Further samples look basically the same with an occasional read-path taking up to ~10% of the time. VACUUM ANALYZE doesn't look to have any effect. I'm running sqlite 3.7.7 on Mac OS X 10.7.5 via the Python sqlite3 module So I feel like something about what I'm doing is fundamentally flawed given something about sqlite's performance model. All I want is a count of the number of times that I've seen each pair (k1, k2), is there a better way to do this without storing them all individually and grouping them later? (This would be prohibitively large.) ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users