Re: [sqlite] Efficient way to store counters

2013-03-20 Thread David King
> Actually my understanding would suggest that INSERT OR REPLACE should
> execute slower than UPDATE + INSERT (or INSERT + UPDATE).


After some experimentation, I think that they turn out to be a wash in my case. 
INSERT OR REPLACE is implemented in sqlite as INSERT OR DELETE THEN INSERT, 
which due to my access patterns puts the really common (k1, k2) pairs together 
at the end of the database (or at least with the highest row ids, which tend to 
group together), which increases the chances to share pages for those common 
pairs.

> And about journal size: imagine that you've got "lucky" and
> all those 94k rows are each in it's own page in the counters table.
> SQLite will have to save each of that pages in the journal which will
> give journal size of about 94k * 4096 ~ 400M.


Maybe it makes sense to drop my page size then all the way down to the disc's 
sector size. That reduces the changes that a given set of pairs shares pages 
but clearly that's already very low. It would at least reduce the number of 
bytes needing to be written, even if the number of seeks stays essentially the 
same.

But since it's seeks that are probably dominating the I/O, maybe it makes sense 
to increase the transaction batch size so that we're overwriting most of the DB 
on every commit rather than overwriting (say) 10%, 10% as often. Does that make 
sense?


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient way to store counters

2013-03-13 Thread David King
> > > BTW, in case you don't do that yet your best performance will be if
> > > you prepare your UPDATE and INSERT statements only once and then do
> > > bind + step + reset in that 100k times loop.
> > >  
> >  
> > In principle I agree, but since the temporary-table version is blindingly 
> > fast up the the update-the-disk portion it's definitely not a bottleneck at 
> > this point
> >  
>  
> I was talking about your initial implementation when you did 100k times
> > update_counter(k1, k2, count=count+1, expires=now+count*1day)
> > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day)
> >  
>  
> Not about your final version with one INSERT OR REPLACE. Was your
> statement about the same thing? If yes I didn't understand what you
> meant.
>  
>  


I just meant that the naïve way of making the prepared statements with python's 
sqlite3 module (which it may or may not cache, but I assume doesn't) was 
already so fast that I'm not worried about shaving a few milliseconds off of 
re-preparing the statements every time when the actual problem occurs at a 
lower level than that.

So yeah, preparing the statement once and re-binding it every time would speed 
things up, but so little that I'd rather solve the real problem of reducing the 
time taken by the disk-writes first

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient way to store counters

2013-03-13 Thread Pavel Ivanov
On Wed, Mar 13, 2013 at 11:48 AM, David King  wrote:
>> BTW, in case you don't do that yet your best performance will be if
>> you prepare your UPDATE and INSERT statements only once and then do
>> bind + step + reset in that 100k times loop.
>
>
> In principle I agree, but since the temporary-table version is blindingly 
> fast up the the update-the-disk portion it's definitely not a bottleneck at 
> this point

I was talking about your initial implementation when you did 100k times

> update_counter(k1, k2, count=count+1, expires=now+count*1day)
> if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day)

Not about your final version with one INSERT OR REPLACE. Was your
statement about the same thing? If yes I didn't understand what you
meant.


Pavel
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient way to store counters

2013-03-13 Thread James K. Lowden
On Wed, 13 Mar 2013 14:19:05 -0400
Igor Tandetnik  wrote:

> > I'm not sure about SQLite, but in principle the query optimizer can
> > often use the base table's index for a derived value.  Consider
> >
> > A join B on A.a = 1 + B.a
> > or
> > A join B on A.a < sqrt(B.a)
> >
> > An index on B.a is useful to finding the values meeting the
> > criterion.
> 
> You seem to expect the optimizer to solve equations - to effectively 
> rewrite the conditions as "B.a = A.a - 1" and "B.a >= 0 and B.a >
> (case when A.a < 0 then 0 else A.a * A.a end)". I'm pretty sure no
> major DBMS does that. Definitely SQLite doesn't.

Thanks for clarifying that.  

As for major DBMSs, I dealt with Microsoft's for years, from Sybase
days.  Joins based on date functions and integer arithmetic observably
uses indexes.  Floating point I'm not sure of; I can't recall a table
indexed on a floating point column.  

--jkl
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient way to store counters

2013-03-13 Thread David King
> > The logic is, "keep a given (k1, k2) pair around for one day for each
> > time it's been seen". I could calculate it when it's needed, but
> > since it's a computed value, I couldn't then have an index on it.
> 
> I'm not sure about SQLite, but in principle the query optimizer can
> often use the base table's index for a derived value. Consider
> 
> A join B on A.a = 1 + B.a
> or
> A join B on A.a < sqrt(B.a)
> 
> An index on B.a is useful to finding the values meeting the criterion. 
> 
> But perhaps you've measured this. How much faster is the updating
> process you're concerned with than the SELECT that would avoid it?


I've measured the performance gain of leaving off that column (and therefore 
index) entirely. It buys me less than my rounding error in performance for the 
updates. I only left it in the problem description for completeness.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient way to store counters

2013-03-13 Thread David King
> BTW, in case you don't do that yet your best performance will be if
> you prepare your UPDATE and INSERT statements only once and then do
> bind + step + reset in that 100k times loop.


In principle I agree, but since the temporary-table version is blindingly fast 
up the the update-the-disk portion it's definitely not a bottleneck at this 
point


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient way to store counters

2013-03-13 Thread David King
On Wednesday, 13 March, 2013 at 06:15, Michael Black wrote:
> You're simply missing the where clause on your update so you're updating the
> entire database every time you do an insert.
> update_counter(k1,k2 count=count+1,expires=now+count*1day) where field1=k1
> and field2=k2;
> 
> And a (k1,k2) index would help that update a lot.

That's pseudo code. In real life it looks more like 

UPDATE counters
SET count=count+1, expires=?+count*(24*60*60)
WHERE k1=? AND k2=?

As in the schema I listed, (k1,k2) is the primary key so there's already an 
index on it.


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient way to store counters

2013-03-13 Thread Igor Tandetnik

On 3/13/2013 8:49 AM, James K. Lowden wrote:

I'm not sure about SQLite, but in principle the query optimizer can
often use the base table's index for a derived value.  Consider

A join B on A.a = 1 + B.a
or
A join B on A.a < sqrt(B.a)

An index on B.a is useful to finding the values meeting the criterion.


You seem to expect the optimizer to solve equations - to effectively 
rewrite the conditions as "B.a = A.a - 1" and "B.a >= 0 and B.a > (case 
when A.a < 0 then 0 else A.a * A.a end)". I'm pretty sure no major DBMS 
does that. Definitely SQLite doesn't.

--
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient way to store counters

2013-03-13 Thread Pavel Ivanov
On Tue, Mar 12, 2013 at 11:03 PM, David King  wrote:
>> > At first I was just doing something like this pseducode:
>> > update_counter(k1, k2, count=count+1, expires=now+count*1day)
>> > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day)
>>
>> Assuming these 2 statements constitute each of the 10k-100k steps you
>> mentioned above and all of these steps are wrapped up in BEGIN-COMMIT
>> block this is probably the most efficient way of doing this. The only
>> improvement could be if you are doing creates more often than updates.
>> Then you can switch and do INSERT first and then UPDATE if necessary.
>> It could gain you a little time.
>
>
> Yeah. I even tried keeping track of how many hits/misses I had and 
> re-ordering the attempted INSERT/UPDATE as appropriate. A batch of 100k of 
> these is done in a single transaction
>
>> > but was having serious performance problems that seems to be confined to 
>> > those lines. So I converted ir to INSERT OR REPLACE which had no 
>> > noticeable impact on performance.
>> Actually my understanding would suggest that INSERT OR REPLACE should
>> execute slower than UPDATE + INSERT (or INSERT + UPDATE).
>
> […]
>> > Convinced the problem was in my code, I decided to offload as much as 
>> > possible to sqlite. Now my code looks like:
>>
>> This should be much-much slower than UPDATE + INSERT.
>
>
> That's unfortunate because the overall performance was about the same ±10% 
> between all three approaches :(
>
>> First of all in the statement above you don't gain benefit from
>> uniqueness and replace about 10k rows twice.
>
>
> Are you sure? The SELECT in the INSERT OR UPDATE selects "FROM 
> trans_counters_v AS c", the grouped temporary view. So it should only see any 
> given key pair once before it starts doing any inserting at all
>
>> Second with such low
>> repeatability you don't gain much from doing it with such complicated
>> INSERT. And about journal size: imagine that you've got "lucky" and
>> all those 94k rows are each in it's own page in the counters table.
>> SQLite will have to save each of that pages in the journal which will
>> give journal size of about 94k * 4096 ~ 400M.
>
>
> I hadn't thought about it that way, that's true. And it's probably wildly 
> seeking all over the disk to do it. The reads are probably fine because the 
> machine has plenty of RAM to devote to page cache, it's the random writes 
> that are killing it.
>
>> I don't think there's anything better than what you did initially.
>
> As for the fundamental approach, I figured as much. The rearrangement into 
> the giant INSERT OR REPLACE was just to prove to myself that the problem 
> wasn't elsewhere in my code
>
> For optimising it on the sqlite front, I've played with page sizes, 
> journaling modes, and changing the transaction batch size without much luck. 
> I don't have strong consistency requirements for e.g. power failures or OS 
> crashes but I do need an application crash to not take it out so I can't just 
> go without the journal altogether (which does help the problem, but isn't 
> huge).

BTW, in case you don't do that yet your best performance will be if
you prepare your UPDATE and INSERT statements only once and then do
bind + step + reset in that 100k times loop.


Pavel
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient way to store counters

2013-03-13 Thread Michael Black
You're simply missing the where clause on your update so you're updating the
entire database every time you do an insert.
update_counter(k1,k2 count=count+1,expires=now+count*1day) where field1=k1
and field2=k2;

And a (k1,k2) index would help that update a lot.



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient way to store counters

2013-03-13 Thread James K. Lowden
On Tue, 12 Mar 2013 21:20:11 -0700
David King  wrote:

> > > At first I was just doing something like this pseducode:
> > > update_counter(k1, k2, count=count+1, expires=now+count*1day)
> > > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now
> > > +1day)
> > 
> > Might I suggest that instead of trying to store an ever-changing
> > value, you simply figure it out when it's needed ? 
> 
> 
> The logic is, "keep a given (k1, k2) pair around for one day for each
> time it's been seen". I could calculate it when it's needed, but
> since it's a computed value, I couldn't then have an index on it.

I'm not sure about SQLite, but in principle the query optimizer can
often use the base table's index for a derived value.  Consider

A join B on A.a = 1 + B.a
or
A join B on A.a < sqrt(B.a)

An index on B.a is useful to finding the values meeting the criterion.  

But perhaps you've measured this.  How much faster is the updating
process you're concerned with than the SELECT that would avoid it?  

--jkl
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient way to store counters

2013-03-12 Thread Pavel Ivanov
>> First of all in the statement above you don't gain benefit from
>> uniqueness and replace about 10k rows twice.
>
> Are you sure? The SELECT in the INSERT OR UPDATE selects "FROM 
> trans_counters_v AS c", the grouped temporary view. So it should only see any 
> given key pair once before it starts doing any inserting at all

Sorry, you are right. I missed the GROUP BY part...


Pavel


On Tue, Mar 12, 2013 at 11:03 PM, David King  wrote:
>> > At first I was just doing something like this pseducode:
>> > update_counter(k1, k2, count=count+1, expires=now+count*1day)
>> > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day)
>>
>> Assuming these 2 statements constitute each of the 10k-100k steps you
>> mentioned above and all of these steps are wrapped up in BEGIN-COMMIT
>> block this is probably the most efficient way of doing this. The only
>> improvement could be if you are doing creates more often than updates.
>> Then you can switch and do INSERT first and then UPDATE if necessary.
>> It could gain you a little time.
>
>
> Yeah. I even tried keeping track of how many hits/misses I had and 
> re-ordering the attempted INSERT/UPDATE as appropriate. A batch of 100k of 
> these is done in a single transaction
>
>> > but was having serious performance problems that seems to be confined to 
>> > those lines. So I converted ir to INSERT OR REPLACE which had no 
>> > noticeable impact on performance.
>> Actually my understanding would suggest that INSERT OR REPLACE should
>> execute slower than UPDATE + INSERT (or INSERT + UPDATE).
>
> […]
>> > Convinced the problem was in my code, I decided to offload as much as 
>> > possible to sqlite. Now my code looks like:
>>
>> This should be much-much slower than UPDATE + INSERT.
>
>
> That's unfortunate because the overall performance was about the same ±10% 
> between all three approaches :(
>
>> First of all in the statement above you don't gain benefit from
>> uniqueness and replace about 10k rows twice.
>
>
> Are you sure? The SELECT in the INSERT OR UPDATE selects "FROM 
> trans_counters_v AS c", the grouped temporary view. So it should only see any 
> given key pair once before it starts doing any inserting at all
>
>> Second with such low
>> repeatability you don't gain much from doing it with such complicated
>> INSERT. And about journal size: imagine that you've got "lucky" and
>> all those 94k rows are each in it's own page in the counters table.
>> SQLite will have to save each of that pages in the journal which will
>> give journal size of about 94k * 4096 ~ 400M.
>
>
> I hadn't thought about it that way, that's true. And it's probably wildly 
> seeking all over the disk to do it. The reads are probably fine because the 
> machine has plenty of RAM to devote to page cache, it's the random writes 
> that are killing it.
>
>> I don't think there's anything better than what you did initially.
>
> As for the fundamental approach, I figured as much. The rearrangement into 
> the giant INSERT OR REPLACE was just to prove to myself that the problem 
> wasn't elsewhere in my code
>
> For optimising it on the sqlite front, I've played with page sizes, 
> journaling modes, and changing the transaction batch size without much luck. 
> I don't have strong consistency requirements for e.g. power failures or OS 
> crashes but I do need an application crash to not take it out so I can't just 
> go without the journal altogether (which does help the problem, but isn't 
> huge).
>
>
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient way to store counters

2013-03-12 Thread David King
> > At first I was just doing something like this pseducode:
> > update_counter(k1, k2, count=count+1, expires=now+count*1day)
> > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day)
>  
> Assuming these 2 statements constitute each of the 10k-100k steps you
> mentioned above and all of these steps are wrapped up in BEGIN-COMMIT
> block this is probably the most efficient way of doing this. The only
> improvement could be if you are doing creates more often than updates.
> Then you can switch and do INSERT first and then UPDATE if necessary.
> It could gain you a little time.


Yeah. I even tried keeping track of how many hits/misses I had and re-ordering 
the attempted INSERT/UPDATE as appropriate. A batch of 100k of these is done in 
a single transaction

> > but was having serious performance problems that seems to be confined to 
> > those lines. So I converted ir to INSERT OR REPLACE which had no noticeable 
> > impact on performance.
> Actually my understanding would suggest that INSERT OR REPLACE should
> execute slower than UPDATE + INSERT (or INSERT + UPDATE).

[…]  
> > Convinced the problem was in my code, I decided to offload as much as 
> > possible to sqlite. Now my code looks like:
>  
> This should be much-much slower than UPDATE + INSERT.


That's unfortunate because the overall performance was about the same ±10% 
between all three approaches :(
  
> First of all in the statement above you don't gain benefit from
> uniqueness and replace about 10k rows twice.


Are you sure? The SELECT in the INSERT OR UPDATE selects "FROM trans_counters_v 
AS c", the grouped temporary view. So it should only see any given key pair 
once before it starts doing any inserting at all

> Second with such low
> repeatability you don't gain much from doing it with such complicated
> INSERT. And about journal size: imagine that you've got "lucky" and
> all those 94k rows are each in it's own page in the counters table.
> SQLite will have to save each of that pages in the journal which will
> give journal size of about 94k * 4096 ~ 400M.


I hadn't thought about it that way, that's true. And it's probably wildly 
seeking all over the disk to do it. The reads are probably fine because the 
machine has plenty of RAM to devote to page cache, it's the random writes that 
are killing it.

> I don't think there's anything better than what you did initially.

As for the fundamental approach, I figured as much. The rearrangement into the 
giant INSERT OR REPLACE was just to prove to myself that the problem wasn't 
elsewhere in my code

For optimising it on the sqlite front, I've played with page sizes, journaling 
modes, and changing the transaction batch size without much luck. I don't have 
strong consistency requirements for e.g. power failures or OS crashes but I do 
need an application crash to not take it out so I can't just go without the 
journal altogether (which does help the problem, but isn't huge).


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient way to store counters

2013-03-12 Thread Pavel Ivanov
On Tue, Mar 12, 2013 at 8:29 PM, David King  wrote:
> I'm trying to find an efficient way to store simple incrementing integers but 
> I'm having trouble finding an efficient way to do it
>
> My database looks like:
>
> CREATE TABLE counters
>   k1, k2,
>   count, -- how many we've seen
>   expires,
>   PRIMARY KEY (k1, k2)
> );
> CREATE INDEX counters_expires_idx ON counters(expires);
>
> It is about 1.9gb and contains ~22 million of these rows. A given transaction 
> updates or creates between 10k and 100k of them.
>
> At first I was just doing something like this pseducode:
>
> update_counter(k1, k2, count=count+1, expires=now+count*1day)
> if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day)

Assuming these 2 statements constitute each of the 10k-100k steps you
mentioned above and all of these steps are wrapped up in BEGIN-COMMIT
block this is probably the most efficient way of doing this. The only
improvement could be if you are doing creates more often than updates.
Then you can switch and do INSERT first and then UPDATE if necessary.
It could gain you a little time.

> but was having serious performance problems that seems to be confined to 
> those lines. So I converted ir to INSERT OR REPLACE which had no noticeable 
> impact on performance.

Actually my understanding would suggest that INSERT OR REPLACE should
execute slower than UPDATE + INSERT (or INSERT + UPDATE).

> Convinced the problem was in my code, I decided to offload as much as 
> possible to sqlite. Now my code looks like:
>
> === cut here =
>
> PRAGMA synchronous=OFF;
> PRAGMA temp_store=MEMORY;
>
>
>
> CREATE TEMPORARY TABLE trans_counters(k1, k2);
>
> -- (add all of the items to that temporary table)
>
> CREATE TEMPORARY VIEW trans_counters_v AS
> SELECT k1 AS k1,
> k2 AS k2,
> COUNT(*) AS count
> FROM trans_counters
> GROUP BY (k1, k2);
>
>
>
> INSERT OR REPLACE INTO counters
> SELECT c.k1 AS k1,
> c.k2 AS k2,
> COALESCE((SELECT count FROM counters WHERE k1 = c.k1 AND k2 = c.k2),
> 0)+c.count AS count,
> (COALESCE((SELECT count FROM counters WHERE k1 = c.k1 AND k2 = c.k2),
> 0)+c.count)*24*60*60+? AS expires
> FROM trans_counters_v AS c

This should be much-much slower than UPDATE + INSERT.

> === cut here =
>
> Now the code that inserts all of the rows into the memory table executes 
> nearly instantly, but the big INSERT takes 15+ minutes. Meanwhile the journal 
> (in either rollback or wal mode) balloons to over 300mb in size. The 
> temporary table itself is only about 1.8mb of data (102,603 rows, 94,064 
> unique) so where is all of the journal coming from?

First of all in the statement above you don't gain benefit from
uniqueness and replace about 10k rows twice. Second with such low
repeatability you don't gain much from doing it with such complicated
INSERT. And about journal size: imagine that you've got "lucky" and
all those 94k rows are each in it's own page in the counters table.
SQLite will have to save each of that pages in the journal which will
give journal size of about 94k * 4096 ~ 400M.

> The process takes nearly 0 CPU during this time, the disk becomes very active 
> (but low throughput, reading and writing maybe 200k/s judging by the rate of 
> growth of the journal), and sampling the process with OS X's Activity Monitor 
> while it's busy outputs:
>
> 100% 2869 _pysqlite_query_execute (in _sqlite3.so) + 1886 [0x101945e5e]
> 100% 2869 pysqlite_step (in _sqlite3.so) + 47 [0x10194893f]
> 100% 2869 sqlite3_step (in libsqlite3.dylib) + 1883 [0x7fff8d95ca5b]
> 100% 2869 sqlite3VdbeExec (in libsqlite3.dylib) + 3327 [0x7fff8d95e3af]
> 100% 2869 sqlite3BtreeMovetoUnpacked (in libsqlite3.dylib) + 761 
> [0x7fff8d97ab89]
> 100% 2869 moveToChild (in libsqlite3.dylib) + 146 [0x7fff8d96c872]
> 100% 2869 sqlite3PagerAcquire (in libsqlite3.dylib) + 194 [0x7fff8d93dc22]
> 100% 2869 sqlite3PcacheFetch (in libsqlite3.dylib) + 475 [0x7fff8d93e02b]
> 100% 2869 pagerStress (in libsqlite3.dylib) + 670 [0x7fff8d9c407e]
> 100% 2869 pager_write_pagelist (in libsqlite3.dylib) + 149 [0x7fff8d999a35]
> 100% 2869 unixWrite (in libsqlite3.dylib) + 83 [0x7fff8d98bd73]
> 100% 2869 pwrite (in libsystem_kernel.dylib) + 10 [0x7fff8130bab6]
>
>
>
> That is, 2869 of 2869 samples, 100% of the time, was spent in sqlite3_step 
> writing the data to disk. Further samples look basically the same with an 
> occasional read-path taking up to ~10% of the time.
>
> VACUUM ANALYZE doesn't look to have any effect. I'm running sqlite 3.7.7 on 
> Mac OS X 10.7.5 via the Python sqlite3 module
>
> So I feel like something about what I'm doing is fundamentally flawed given 
> something about sqlite's performance model. All I want is a count of the 
> number of times that I've seen each pair (k1, k2), is there a better way to 
> do this without storing them all individually and grouping them later? (This 
> would be prohibitively large.)

I don't think there's anything better than what you did initially.


Pavel
__

Re: [sqlite] Efficient way to store counters

2013-03-12 Thread David King
> > At first I was just doing something like this pseducode:
> > update_counter(k1, k2, count=count+1, expires=now+count*1day)
> > if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day)
> 
> Might I suggest that instead of trying to store an ever-changing value, you 
> simply figure it out when it's needed ? I don't quite understand the logic 
> you're applying to calculate your 'expires' value but I think it could be 
> calculated with a SELECT whenever you needed it rather than being stored.


The logic is, "keep a given (k1, k2) pair around for one day for each time it's 
been seen". I could calculate it when it's needed, but since it's a computed 
value, I couldn't then have an index on it. That said, I've removed this in 
several tests and it doesn't appear have any bearing on the performance issue.

> > Now the code that inserts all of the rows into the memory table executes 
> > nearly instantly, but the big INSERT takes 15+ minutes. Meanwhile the 
> > journal (in either rollback or wal mode) balloons to over 300mb in size.
> You don't list any indexes which would help with your WHERE clauses, so I 
> suspect SQLite is having to look through all the records in 'counters' in 
> order to find the rows it needs for each COALESCE. The large size of the 
> journal is because you are replacing every row in the databases.


Maybe I wasn't clear, the  ---cut here--- bit is in addition to the existing 
schema (after all, the INSERT OR REPLACE updates it, so surely it must already 
exist).

In the INSERT OR REPLACE operation there is no WHERE clause. (k1, k2) is the 
primary key on the 'counters' table, so the INSERT OR REPLACE takes each value 
out of the temporary trans_counters_v view of the in-memory trans_counters 
temporary table and either replaces or inserts a value for each corresponding 
entry in 'counters'. AFAICT, no amount of indexing here would help. There's no 
reason to index the temporary table, since the sort for the GROUP BY is 
n*log(n), and updating the index for each individual entry would still sum to 
n*log(n). The temporary table is single-use and will have to be scanned every 
time regardless. (And if I isolate that GROUP BY operation, it's blindingly 
fast.). Here's the EXPLAIN QUERY PLAN for the INSERT OR REPLACE:

(0, 0, 0, u'SCAN TABLE trans_counters (132971 rows)')
(0, 0, 0, u'USE TEMP B-TREE FOR GROUP BY')
(0, 0, 0, u'EXECUTE CORRELATED SCALAR SUBQUERY 1')
(1, 0, 0, u'SEARCH TABLE counters USING INDEX sqlite_autoindex_counters_1 (k1=? 
AND k2=?) (~1 rows)')
(0, 0, 0, u'EXECUTE CORRELATED SCALAR SUBQUERY 2')
(2, 0, 0, u'SEARCH TABLE counters USING INDEX sqlite_autoindex_counters_1 (k1=? 
AND k2=?) (~1 rows)')





So no, it's not having to replace every entry in the 'counters' table. Also, if 
it were replacing every row in the database, then the journal would grow to 
equal the (VACUUM'd) size of the database, but it doesn't get that big. It gets 
to 300mb+, but not to the full size of 1.9gb.
 
> > So I feel like something about what I'm doing is fundamentally flawed given 
> > something about sqlite's performance model. All I want is a count of the 
> > number of times that I've seen each pair (k1, k2), is there a better way to 
> > do this without storing them all individually and grouping them later?
> 
> If you have a table with two columns k1, k2 in, and you want to count the 
> times each pair occurs, you can do it in software far faster by having this 
> index and using this SELECT
> CREATE INDEX myTable_keypair ON myTable (k1,k2)
> SELECT k1,k2 from myTable ORDER BY k1,k2
> you might even use one of the following if you know it will always return a 
> unique value for unique keys
> SELECT k1||k2 from myTable ORDER BY k1,k2
> SELECT k1||':'||k2 from myTable ORDER BY k1,k2
> Just count the unique values in your programming language as they go past. 
> Yes, you can use horrendous complication to make SQLite present a neatly 
> formatted return with the counts included, but defining that in SQL makes 
> SQLite do more work than your programming language would need to do.


The problem isn't grouping them to count them. If I evaluate the TEMPORARY VIEW 
that does the GROUP BY clause above it completes in just under a second for all 
100k items in the table it references. That part is very fast. The slow bit is 
incorporating those counts into the versions on disk.



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient way to store counters

2013-03-12 Thread Simon Slavin

On 13 Mar 2013, at 3:29am, David King  wrote:

> At first I was just doing something like this pseducode:
> 
> update_counter(k1, k2, count=count+1, expires=now+count*1day)
> if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day)

Might I suggest that instead of trying to store an ever-changing value, you 
simply figure it out when it's needed ?  I don't quite understand the logic 
you're applying to calculate your 'expires' value but I think it could be 
calculated with a SELECT whenever you needed it rather than being stored.

> Now the code that inserts all of the rows into the memory table executes 
> nearly instantly, but the big INSERT takes 15+ minutes. Meanwhile the journal 
> (in either rollback or wal mode) balloons to over 300mb in size.

You don't list any indexes which would help with your WHERE clauses, so I 
suspect SQLite is having to look through all the records in 'counters' in order 
to find the rows it needs for each COALESCE.  The large size of the journal is 
because you are replacing every row in the databases.

> So I feel like something about what I'm doing is fundamentally flawed given 
> something about sqlite's performance model. All I want is a count of the 
> number of times that I've seen each pair (k1, k2), is there a better way to 
> do this without storing them all individually and grouping them later?

If you have a table with two columns k1, k2 in, and you want to count the times 
each pair occurs, you can do it in software far faster by having this index and 
using this SELECT

CREATE INDEX myTable_keypair ON myTable (k1,k2)

SELECT k1,k2 from myTable ORDER BY k1,k2

you might even use one of the following if you know it will always return a 
unique value for unique keys

SELECT k1||k2 from myTable ORDER BY k1,k2
SELECT k1||':'||k2 from myTable ORDER BY k1,k2

Just count the unique values in your programming language as they go past.  
Yes, you can use horrendous complication to make SQLite present a neatly 
formatted return with the counts included, but defining that in SQL makes 
SQLite do more work than your programming language would need to do.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Efficient way to store counters

2013-03-12 Thread David King
I'm trying to find an efficient way to store simple incrementing integers but 
I'm having trouble finding an efficient way to do it

My database looks like:

CREATE TABLE counters
  k1, k2,
  count, -- how many we've seen
  expires,
  PRIMARY KEY (k1, k2)
);
CREATE INDEX counters_expires_idx ON counters(expires);

It is about 1.9gb and contains ~22 million of these rows. A given transaction 
updates or creates between 10k and 100k of them.

At first I was just doing something like this pseducode:

update_counter(k1, k2, count=count+1, expires=now+count*1day)
if rows_updated != 1: insert_counter(k1, k2, count=1, expires=now+1day)

but was having serious performance problems that seems to be confined to those 
lines. So I converted ir to INSERT OR REPLACE which had no noticeable impact on 
performance. Convinced the problem was in my code, I decided to offload as much 
as possible to sqlite. Now my code looks like:

=== cut here =

PRAGMA synchronous=OFF;
PRAGMA temp_store=MEMORY;



CREATE TEMPORARY TABLE trans_counters(k1, k2);

-- (add all of the items to that temporary table)

CREATE TEMPORARY VIEW trans_counters_v AS
SELECT k1 AS k1,
k2 AS k2,
COUNT(*) AS count
FROM trans_counters
GROUP BY (k1, k2);



INSERT OR REPLACE INTO counters
SELECT c.k1 AS k1,
c.k2 AS k2,
COALESCE((SELECT count FROM counters WHERE k1 = c.k1 AND k2 = c.k2),
0)+c.count AS count,
(COALESCE((SELECT count FROM counters WHERE k1 = c.k1 AND k2 = c.k2),
0)+c.count)*24*60*60+? AS expires
FROM trans_counters_v AS c





=== cut here =

Now the code that inserts all of the rows into the memory table executes nearly 
instantly, but the big INSERT takes 15+ minutes. Meanwhile the journal (in 
either rollback or wal mode) balloons to over 300mb in size. The temporary 
table itself is only about 1.8mb of data (102,603 rows, 94,064 unique) so where 
is all of the journal coming from?. The process takes nearly 0 CPU during this 
time, the disk becomes very active (but low throughput, reading and writing 
maybe 200k/s judging by the rate of growth of the journal), and sampling the 
process with OS X's Activity Monitor while it's busy outputs:

100% 2869 _pysqlite_query_execute (in _sqlite3.so) + 1886 [0x101945e5e]
100% 2869 pysqlite_step (in _sqlite3.so) + 47 [0x10194893f]
100% 2869 sqlite3_step (in libsqlite3.dylib) + 1883 [0x7fff8d95ca5b]
100% 2869 sqlite3VdbeExec (in libsqlite3.dylib) + 3327 [0x7fff8d95e3af]
100% 2869 sqlite3BtreeMovetoUnpacked (in libsqlite3.dylib) + 761 
[0x7fff8d97ab89]
100% 2869 moveToChild (in libsqlite3.dylib) + 146 [0x7fff8d96c872]
100% 2869 sqlite3PagerAcquire (in libsqlite3.dylib) + 194 [0x7fff8d93dc22]
100% 2869 sqlite3PcacheFetch (in libsqlite3.dylib) + 475 [0x7fff8d93e02b]
100% 2869 pagerStress (in libsqlite3.dylib) + 670 [0x7fff8d9c407e]
100% 2869 pager_write_pagelist (in libsqlite3.dylib) + 149 [0x7fff8d999a35]
100% 2869 unixWrite (in libsqlite3.dylib) + 83 [0x7fff8d98bd73]
100% 2869 pwrite (in libsystem_kernel.dylib) + 10 [0x7fff8130bab6]



That is, 2869 of 2869 samples, 100% of the time, was spent in sqlite3_step 
writing the data to disk. Further samples look basically the same with an 
occasional read-path taking up to ~10% of the time.

VACUUM ANALYZE doesn't look to have any effect. I'm running sqlite 3.7.7 on Mac 
OS X 10.7.5 via the Python sqlite3 module

So I feel like something about what I'm doing is fundamentally flawed given 
something about sqlite's performance model. All I want is a count of the number 
of times that I've seen each pair (k1, k2), is there a better way to do this 
without storing them all individually and grouping them later? (This would be 
prohibitively large.) 


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users