On 12/11/2011 5:02 p.m., Simon Slavin wrote:
On 12 Nov 2011, at 3:43am, Josh Gibbs wrote:

We are struggling to find a way to rework this query in a way
that performs efficiently for large data sets.
It's all about the indexes.

The problem with this query seems to be related to the aggregation
of the counts, which of course we can't add an index to.  The final order
by is where the performance suddenly takes the hit.

The goal is to find the top x most active senders of e-mails
within a date range.

Do you frequently do this for many of random date ranges ?  Or do you have a 
set of standard requirements, like 'Most popular posters today, this week, this 
month.' ?

We have some predefined selections, but we also have a self-selection
range.  We are trying to avoid creating pre-aggregated sets if it's at all
possible.

WHERE
    CAST (messages.TimeStamp AS INTEGER) BETWEEN 0 AND 9999999999
It's a little late at night, so forgive me.  What is this for ?  What formats 
are the TimeStamps without CASTing ?  Can you store the TimeStamps as integers, 
or even get rid of this entirely ?  Oh hold on ... this is just for testing and 
in real life you set it to a smaller span ?

The timestamps are already integers.  We stumbled across that CAST operation
optimization purely by accident. I don't remember what led to it, but we found that it gave a measurable performance boost casting the integer as an integer.
You might want to check into that one.

And yes, those numbers are just for testing. Normally that range is an epoch time range spanning the requested time. I left that in there because it is part of the
query we are using to be complete.

ORDER BY Total DESC
This is what you say is killing you.  Just noting it.

To reiterate from my original question, if we don't add the order by then
the results come back at a very acceptable speed.  But of course we'd
then have to iterate through the results ourselves to siphon off the
top item set that we're after.  I'd really like the DB to do that for me.

CREATE INDEX messages_timestamp_index ON messages (TimeStamp);
Because you do a CAST in your 'WHERE' clause, I don't know whether this index 
would actually be used.
As mentioned previously, the CAST makes the query quicker.  However in our
tests right now we're using a smaller data set of around a million records of
messages/senders and not actually using the WHERE clause at all.  It doesn't
make much difference to the overall query time.


_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to