Hi Paul, all!

Paul Halliday wrote:
> On Wed, Dec 9, 2009 at 2:24 PM, Joerg Bruehe <joerg.bru...@sun.com> wrote:
>> Hi everybody!
>>
>>
>> Neil Aggarwal wrote:
>>> Paul:
>>>
>>>> SELECT COUNT(event.src_ip) AS count, INET_NTOA(event.src_ip),
>>>> mappings.cc FROM event, mappings WHERE event.timestamp BETWEEN
>>>> '2009-12-06 20:00' and '2009-12-07 20:00:00' AND event.src_ip BETWEEN
>>>> 'mappings.start_ip' AND 'mappings.end_ip' GROUP BY event.src_ip ORDER
>>>> BY count DESC LIMIT 20;
>> I am surprised by the quotes you have around the "start_ip" and "end_ip"
>> columns; to me, this makes that look like strings.
> 
> That because I don't know what I am doing :). No quotes on integers; got it!

Like Andy replied: No quotes around column names (or expressions) you
want evaluated as variables, these quotes mark (constant) strings.
(There is another kind of quotes which serves to solve problems around
using large and small letters in names, I'll ignore that here.)

> 
>> From your posting, I see the result you hope to get but not the one you
>> actually get. IMO, just dropping the single quotes around the two column
>> names should produce the data you want to get.
>>
>> Or what is the result you receive?
> 
> Removing the quotes does work. The query however took 1h15m to complete. Yuck.

In general, Andy is right recommending "explain".


This case, however, may be a bit different:
> 
> I am guessing this is because even though there is a limit, it is
> still doing the lookup on everything past 20. Also, if the first
> address has a count of say 2000, it would be doing the lookup 2000
> times for a single address.
> 
> Is that right?

Yes, that may be.

First, don't forget that the query has to evaluate all events (within
your date/time limits) to find which IP addresses are the 20 with the
highest count.

Second, I fear the server may first perform the join on all your data
and only then do the WHERE, the GROUP BY, and the LIMIT. Which means it
would handle much more data than needed.

I guess you might see an enormous boost if you split up your statement
into two parts (syntax not checked):

INSERT INTO events_to_locate
   SELECT COUNT(src_ip) AS count, src_ip FROM event
   WHERE timestamp BETWEEN '2009-12-06 20:00' and '2009-12-07 20:00:00'
   GROUP BY src_ip ORDER BY count DESC LIMIT 20;

SELECT count, INET_NTOA(events_to_locate.src_ip), mappings.cc
   FROM events_to_locate, mappings
   WHERE events_to_locate.src_ip BETWEEN mappings.start_ip AND
                                         mappings.end_ip;

If your "mappings" table does not have a "src_ip" column (your column
names are unique in the tables listed in the FROM clause), you need not
give the table names in that second statement:

SELECT count, INET_NTOA(src_ip), cc
   FROM events_to_locate, mappings
   WHERE src_ip BETWEEN start_ip AND end_ip;

"Division of complexity":
Maybe the optimizer is not clever enough to apply this optimization,
then using a temporary table which holds just the 20 IP addresses you
are interested in should significantly reduce the amount of work.

> 
> If it is I guess I will have to post process the results. Which is
> fine, I just like to keep as much in the queries as I can.

In general, doing work in the server is better, because it means you
need to transfer less data between server and client.
As usual, there are exceptions.


HTH,
Jörg

-- 
Joerg Bruehe,  MySQL Build Team,  joerg.bru...@sun.com
Sun Microsystems GmbH,   Komturstraße 18a,   D-12099 Berlin
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Haering     Muenchen: HRB161028


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql?unsub=arch...@jab.org

Reply via email to