Re: [sqlite] Optimizing for space and speed

2006-01-14 Thread Dan Kennedy


--- Martin O'Leary <[EMAIL PROTECTED]> wrote:

> Hi guys,
> 
> I have a table like the following:
> 
> CREATE TABLE user_actions (
> uid INTEGER NOT NULL,
> actionid INTEGER NOT NULL,
> time INTEGER NOT NULL,
> status INTEGER NOT NULL,
> PRIMARY KEY (uid, actionid, time, status)
> );
> 
> And I want to carry out a query something like this:
> 
> SELECT uid, actionid, MAX(time), status FROM user_actions GROUP BY
> uid, actionid;
> 
> i.e. finding the last time each user performed each action (numbers of
> users and distinct actions are small in comparison to number of times
> each user performs each action).
> 
> Here, I find two problems. Firstly, because my table doesn't have an
> INTEGER PRIMARY KEY, I get an autogenerated index on my compound key.
> No problem there. However, and I may be missing something, it seems
> that there's a lot of data duplication going on. All the data from my
> main table is available in the index (status shouldn't really be part
> of the key, but I added it in order to increase query performance). As
> far as I can tell, this means that the main table is never consulted,
> but just sits there, doubling the size of my database. Is there any
> way around this?

You can't do anything about that unfortunately. 

> Secondly, query performance is quite slow. It seems to me that no
> optimisation is being carried out on the MAX(time) expression. Is this
> the case, and if so, why not? Surely it's possible to do this in a
> nice, logarithmic way.

New versions of SQLite should execute this query in O(N) time. Are you
seeing otherwise? I don't really see how you could improve on that 
given the query and index.


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



Re: [sqlite] Optimizing for space and speed

2006-01-11 Thread drh
Martin O'Leary <[EMAIL PROTECTED]> wrote:
> On 1/11/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> > Martin O'Leary <[EMAIL PROTECTED]> wrote:
> > >
> > > Also, with regard to 3.3.0, the alpha release seems to slow down my
> > > example query by about 20% (The virtual machine opcodes are
> > > identical). Is this a bug?
> > >
> >
> > A bug is when it gets the wrong answer.
> >
> > Nevertheless we are concerned about performance.  What
> > are you comparing 3.3.0 against and what platform are
> > you running on?
> 
> Compared to 3.2.8, running on Linux (2.6.10-5-amd64-k8, Ubuntu 5.04)
> on a 1.8 GHz machine with 3 GB RAM and ample disk space.
> 

Thanks.  We will work the problem.  Correctness first, then speed.
--
D. Richard Hipp <[EMAIL PROTECTED]>



Re: [sqlite] Optimizing for space and speed

2006-01-11 Thread Martin O'Leary
On 1/11/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> Martin O'Leary <[EMAIL PROTECTED]> wrote:
> >
> > Also, with regard to 3.3.0, the alpha release seems to slow down my
> > example query by about 20% (The virtual machine opcodes are
> > identical). Is this a bug?
> >
>
> A bug is when it gets the wrong answer.
>
> Nevertheless we are concerned about performance.  What
> are you comparing 3.3.0 against and what platform are
> you running on?

Compared to 3.2.8, running on Linux (2.6.10-5-amd64-k8, Ubuntu 5.04)
on a 1.8 GHz machine with 3 GB RAM and ample disk space.

Martin


Re: [sqlite] Optimizing for space and speed

2006-01-11 Thread drh
Martin O'Leary <[EMAIL PROTECTED]> wrote:
> 
> Also, with regard to 3.3.0, the alpha release seems to slow down my
> example query by about 20% (The virtual machine opcodes are
> identical). Is this a bug?
> 

A bug is when it gets the wrong answer.

Nevertheless we are concerned about performance.  What
are you comparing 3.3.0 against and what platform are
you running on?

--
D. Richard Hipp <[EMAIL PROTECTED]>



Re: [sqlite] Optimizing for space and speed

2006-01-11 Thread Martin O'Leary
On 1/11/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> Martin O'Leary <[EMAIL PROTECTED]> wrote:
> > Hi guys,
> >
> > I have a table like the following:
> >
> > CREATE TABLE user_actions (
> > uid INTEGER NOT NULL,
> > actionid INTEGER NOT NULL,
> > time INTEGER NOT NULL,
> > status INTEGER NOT NULL,
> > PRIMARY KEY (uid, actionid, time, status)
> > );
> >
> > And I want to carry out a query something like this:
> >
> > SELECT uid, actionid, MAX(time), status FROM user_actions GROUP BY
> > uid, actionid;
>
> Are you thinking that the value of "status" returned will
> be the one which has the maximum value for "time"?  SQL doesn't
> work that way.  To understand why not, consider this query:

Ah, yes. I took over this SQL from someone else and hadn't really
thought about its correctness, merely its speed. Thanks for pointing
it out.

> All data is duplicated - it appears in both the table and in the
> index.  There is no way around that.
>
Wasn't really expecting it, but thought it couldn't hurt to ask.

> Repeat until done.  Runtime is O(MlogN) where N is the number
> of rows in the table and M is the number of rows of output.
> Since M is likely much less than N, this approach will be much
> faster.

Yes, that's more or less the way I'd optimised the query in my head,
and I was hoping there was a way to force SQLite to do things that
way. I guess I'll just implement it in the application. Thanks for
your help.

Also, with regard to 3.3.0, the alpha release seems to slow down my
example query by about 20% (The virtual machine opcodes are
identical). Is this a bug?

Thanks again,
Martin


Re: [sqlite] Optimizing for space and speed

2006-01-11 Thread drh
[EMAIL PROTECTED] wrote:
> 
> The query as you have specified it runs in O(NlogN) time
> where N is the number of rows in the table.

Actually, the original query runs in O(N) time.  I was mistaken.
But I still think I am right in saying that my "manual" scheme
runs in O(MlogN) time.  So the manual scheme will only be faster 
if O(N)>O(MlogN).  So your choice of algorithm will depend a
lot on the values of M and N.

> 
> I suggest you do the query "manually".  To begin with, use
> SQLite version 3.3.0 (which supports DESC indices) and create
> your table with a separate index like this:
> 
>  CREATE TABLE user_actions (
>  uid INTEGER NOT NULL,
>  actionid INTEGER NOT NULL,
>  time INTEGER NOT NULL,
>  status INTEGER NOT NULL
>  );
>  CREATE INDEX user_actions_pk
>  ON user_actions(uid,actionid,time DESC,status);
> 
> The DESC attribute on the time field of the index means that
> index entries will be in descending time order instead of
> the default ascending.  Hence, the first entry in the index
> for a particular uid+actionid will be the one with the largest
> time value.
> 
> Find the first record in O(logN) time like this:
> 
>   SELECT * FROM user_actions 
>ORDER BY uid, actionid, time DESC
>LIMIT 1;
> 
> Find all other actions for the same user this way:
> 
>   SELECT * FROM user_actions
>WHERE uid=:previous_uid AND actionid>:previous_actionid
>ORDER BY uid, actionid, time DESC
>LIMIT 1;
> 
> Once you have all records for a single user, advance to the
> next user this way:
> 
>   SELECT * FROM user_actions
>WHERE uid>:previous_uid
>ORDER BY uid, actionid, time DESC
>LIMIT 1
> 
> Repeat until done.  Runtime is O(MlogN) where N is the number
> of rows in the table and M is the number of rows of output.
> Since M is likely much less than N, this approach will be much
> faster.
> 
--
D. Richard Hipp <[EMAIL PROTECTED]>



Re: [sqlite] Optimizing for space and speed

2006-01-11 Thread drh
Martin O'Leary <[EMAIL PROTECTED]> wrote:
> Hi guys,
> 
> I have a table like the following:
> 
> CREATE TABLE user_actions (
> uid INTEGER NOT NULL,
> actionid INTEGER NOT NULL,
> time INTEGER NOT NULL,
> status INTEGER NOT NULL,
> PRIMARY KEY (uid, actionid, time, status)
> );
> 
> And I want to carry out a query something like this:
> 
> SELECT uid, actionid, MAX(time), status FROM user_actions GROUP BY
> uid, actionid;

Are you thinking that the value of "status" returned will
be the one which has the maximum value for "time"?  SQL doesn't
work that way.  To understand why not, consider this query:

   SELECT uid, actionid, MAX(time), MIN(time), status
 FROM user_actions
GROUP BY uid, actionid;


> 
> Here, I find two problems. Firstly, because my table doesn't have an
> INTEGER PRIMARY KEY, I get an autogenerated index on my compound key.
> No problem there. However, and I may be missing something, it seems
> that there's a lot of data duplication going on. All the data from my
> main table is available in the index (status shouldn't really be part
> of the key, but I added it in order to increase query performance). As
> far as I can tell, this means that the main table is never consulted,
> but just sits there, doubling the size of my database. Is there any
> way around this?

All data is duplicated - it appears in both the table and in the
index.  There is no way around that.

> 
> Secondly, query performance is quite slow. It seems to me that no
> optimisation is being carried out on the MAX(time) expression. Is this
> the case, and if so, why not? Surely it's possible to do this in a
> nice, logarithmic way.
> 

The query as you have specified it runs in O(NlogN) time
where N is the number of rows in the table.
Perhaps an enterprise-class RDBMS with a really big and
really expensive query optimizer can do better, but SQLite
isn't that smart.

I suggest you do the query "manually".  To begin with, use
SQLite version 3.3.0 (which supports DESC indices) and create
your table with a separate index like this:

 CREATE TABLE user_actions (
 uid INTEGER NOT NULL,
 actionid INTEGER NOT NULL,
 time INTEGER NOT NULL,
 status INTEGER NOT NULL
 );
 CREATE INDEX user_actions_pk
 ON user_actions(uid,actionid,time DESC,status);

The DESC attribute on the time field of the index means that
index entries will be in descending time order instead of
the default ascending.  Hence, the first entry in the index
for a particular uid+actionid will be the one with the largest
time value.

Find the first record in O(logN) time like this:

  SELECT * FROM user_actions 
   ORDER BY uid, actionid, time DESC
   LIMIT 1;

Find all other actions for the same user this way:

  SELECT * FROM user_actions
   WHERE uid=:previous_uid AND actionid>:previous_actionid
   ORDER BY uid, actionid, time DESC
   LIMIT 1;

Once you have all records for a single user, advance to the
next user this way:

  SELECT * FROM user_actions
   WHERE uid>:previous_uid
   ORDER BY uid, actionid, time DESC
   LIMIT 1

Repeat until done.  Runtime is O(MlogN) where N is the number
of rows in the table and M is the number of rows of output.
Since M is likely much less than N, this approach will be much
faster.

--
D. Richard Hipp <[EMAIL PROTECTED]>



Re: [sqlite] Optimizing for space and speed

2006-01-11 Thread Martin O'Leary
On 1/11/06, Dan Kennedy <[EMAIL PROTECTED]> wrote:
>
>
> --- Martin O'Leary <[EMAIL PROTECTED]> wrote:
>
> > Hi guys,
> >
> > I have a table like the following:
> >
> > CREATE TABLE user_actions (
> > uid INTEGER NOT NULL,
> > actionid INTEGER NOT NULL,
> > time INTEGER NOT NULL,
> > status INTEGER NOT NULL,
> > PRIMARY KEY (uid, actionid, time, status)
> > );
> >
> > And I want to carry out a query something like this:
> >
> > SELECT uid, actionid, MAX(time), status FROM user_actions GROUP BY
> > uid, actionid;
> >
> > i.e. finding the last time each user performed each action (numbers of
> > users and distinct actions are small in comparison to number of times
> > each user performs each action).
>
> > Secondly, query performance is quite slow. It seems to me that no
> > optimisation is being carried out on the MAX(time) expression. Is this
> > the case, and if so, why not? Surely it's possible to do this in a
> > nice, logarithmic way.
>
> New versions of SQLite should execute this query in O(N) time. Are you
> seeing otherwise? I don't really see how you could improve on that
> given the query and index.

Surely it can be done in O(log N), assuming constant numbers of user
and action ids? All that needs to be done is to find all distinct
(uid, actionid) pairs (which I believe should be O(log N)), and then
find the first entry in the index for each (again, O(log N)).

Martin


Re: [sqlite] Optimizing for space and speed

2006-01-11 Thread Dan Kennedy


--- Martin O'Leary <[EMAIL PROTECTED]> wrote:

> Hi guys,
> 
> I have a table like the following:
> 
> CREATE TABLE user_actions (
> uid INTEGER NOT NULL,
> actionid INTEGER NOT NULL,
> time INTEGER NOT NULL,
> status INTEGER NOT NULL,
> PRIMARY KEY (uid, actionid, time, status)
> );
> 
> And I want to carry out a query something like this:
> 
> SELECT uid, actionid, MAX(time), status FROM user_actions GROUP BY
> uid, actionid;
> 
> i.e. finding the last time each user performed each action (numbers of
> users and distinct actions are small in comparison to number of times
> each user performs each action).
> 
> Here, I find two problems. Firstly, because my table doesn't have an
> INTEGER PRIMARY KEY, I get an autogenerated index on my compound key.
> No problem there. However, and I may be missing something, it seems
> that there's a lot of data duplication going on. All the data from my
> main table is available in the index (status shouldn't really be part
> of the key, but I added it in order to increase query performance). As
> far as I can tell, this means that the main table is never consulted,
> but just sits there, doubling the size of my database. Is there any
> way around this?

You can't do anything about that unfortunately. 

> Secondly, query performance is quite slow. It seems to me that no
> optimisation is being carried out on the MAX(time) expression. Is this
> the case, and if so, why not? Surely it's possible to do this in a
> nice, logarithmic way.

New versions of SQLite should execute this query in O(N) time. Are you
seeing otherwise? I don't really see how you could improve on that 
given the query and index.


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com