Re: how to efficiently query for the next in MySQL Community Edition 5.1.34?

Mike Spreitzer Sat, 20 Jun 2009 20:49:45 -0700

Sorry I have not been careful enough.  Following is a very concrete, 
worked example -- so I think I have finally gotten the bugs out.  After 
the example I resume with unanswered questions.

Remember I did not say each integer appears only once, and consider this 
dataset:

create table t (s char(1), i int);
insert into t values ('a', 1), ('b', 1), ('b', 3), ('c', 3), ('b', 4), 
('a', 5), ('c', 5);
mysql> select * from t;
+------+------+
| s    | i    |
+------+------+
| a    |    1 | 
| b    |    1 | 
| b    |    3 | 
| c    |    3 | 
| b    |    4 | 
| a    |    5 | 
| c    |    5 | 
+------+------+
7 rows in set (0.00 sec)

Here is the ineffecient way to compute the desired answer:

mysql> select t1.*, t2.i as j from t as t1, t as t2 where t1.s=t2.s and 
t1.i < t2.i and not exists (select * from t as t12 where t12.s=t1.s and 
t1.i < t12.i  and t12.i < t2.i);
+------+------+------+
| s    | i    | j    |
+------+------+------+
| b    |    1 |    3 | 
| b    |    3 |    4 | 
| a    |    1 |    5 | 
| c    |    3 |    5 | 
+------+------+------+
4 rows in set (0.01 sec)

Here is the better way, using min() (the order of the rows is unimportant 
here):

mysql> select t1.*, min(t2.i) as j from t as t1, t as t2 where t1.s=t2.s 
and t1.i<t2.i group by t1.s, t1.i;
+------+------+------+
| s    | i    | j    |
+------+------+------+
| a    |    1 |    5 | 
| b    |    1 |    3 | 
| b    |    3 |    4 | 
| c    |    3 |    5 | 
+------+------+------+
4 rows in set (0.00 sec)

Code that assumes uniqueness of the integers does not work:

mysql> SELECT a.i, MIN(b.i) AS j 
    -> FROM t AS a 
    -> JOIN t AS b ON b.i > a.i AND a.s = b.s
    -> GROUP BY  a.i;
+------+------+
| i    | j    |
+------+------+
|    1 |    3 | 
|    3 |    4 | 
+------+------+
2 rows in set (0.00 sec)

What remains unclear to me is how fast the correct min-based query (the 
"better way" above) will run.  You and I know that we could walk a BTREE 
on (s, i) and compute the answer in linear time.  As I read the MySQL 
documentation, however, this query does not fit the constraints for fast 
range-based indexing into t2 because the "t1.i < t2.i" comparison does not 
compare t2's value with something that meets the criteria for "a 
constant".  It looks like the query planner plans to do a nested 
enumeration of the integers associated with s, for each (s, i) row in the 
outer enumeration:

mysql> alter table t add primary key (s, i);
Query OK, 7 rows affected (0.01 sec)
Records: 7  Duplicates: 0  Warnings: 0

mysql> explain select t1.*, min(t2.i) as j from t as t1, t as t2 where 
t1.s=t2.s and t1.i<t2.i group by t1.s, t1.i;
+----+-------------+-------+-------+---------------+---------+---------+-----------------+------+--------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | 
ref             | rows | Extra                    |
+----+-------------+-------+-------+---------------+---------+---------+-----------------+------+--------------------------+
|  1 | SIMPLE      | t1    | index | PRIMARY       | PRIMARY | 5       | 
NULL            |    7 | Using index              | 
|  1 | SIMPLE      | t2    | ref   | PRIMARY       | PRIMARY | 1       | 
mjs090605a.t1.s |    3 | Using where; Using index | 
+----+-------------+-------+-------+---------------+---------+---------+-----------------+------+--------------------------+

That is an improvement over my original formulation:

mysql> explain select t1.*, t2.i as j from t as t1, t as t2 where 
t1.s=t2.s and t1.i < t2.i and not exists (select * from t as t12 where 
t12.s=t1.s and t1.i < t12.i  and t12.i < t2.i);
+----+--------------------+-------+-------+---------------+---------+---------+-----------------+------+--------------------------+
| id | select_type        | table | type  | possible_keys | key     | 
key_len | ref             | rows | Extra                    |
+----+--------------------+-------+-------+---------------+---------+---------+-----------------+------+--------------------------+
|  1 | PRIMARY            | t1    | index | PRIMARY       | PRIMARY | 5  | 
NULL            |    7 | Using index              | 
|  1 | PRIMARY            | t2    | ref   | PRIMARY       | PRIMARY | 1  | 
mjs090605a.t1.s |    3 | Using where; Using index | 
|  2 | DEPENDENT SUBQUERY | t12   | ref   | PRIMARY       | PRIMARY | 1  | 
mjs090605a.t1.s |    3 | Using where; Using index | 
+----+--------------------+-------+-------+---------------+---------+---------+-----------------+------+--------------------------+

We have reduced the time complexity from O( (size of T) * (avg num 
integers per string)^2 ) to O( (size of T) * (avg num integers per 
string)^1 ).  That's great.  We have saved about a factor of 100 in my 
real application (a given string is paired with something on the order of 
100 different integers).  But we could save another factor of 100.  How do 
I save (even a portion of) that?

Thanks,
Mike Spreitzer

Peter Brawley <[email protected]> 
06/20/09 06:56 PM
Please respond to
[email protected]

To
Mike Spreitzer/Watson/i...@ibmus
cc
[email protected]
Subject
Re: how to efficiently query for the next in MySQL Community Edition 
5.1.34?

Mike,

>Oops, I did not read your original query closely enough. 
>You actually meant to group by S, not I, right?  

No, it's a query for next i values with matching s values, so it groups by 
i.

>I can get S, I, and J with 

>SELECT a.s, a.i, MIN(b.i) AS j 
>FROM t AS a 
>JOIN t AS b ON b.i > a.i AND a.s = b.s
>GROUP BY  a.s 

For this dataset ...

drop table if exists t;
create table t(i int,s char(1));
insert into t 
values(1,'a'),(2,'b'),(3,'c'),(4,'a'),(5,'a'),(6,'d'),(7,'e'),(8,'d');

are these the correct next values of i?

+------+------+
| i    | j    |
+------+------+
|    1 |    4 |
|    4 |    5 |
|    6 |    8 |
+------+------+

Your query doesn't return that.

PB

-----

Mike Spreitzer wrote: 

Oops, I did not read your original query closely enough.  You actually 
meant to group by S, not I, right?  I can get S, I, and J with 

SELECT a.s, a.i, MIN(b.i) AS j 
FROM t AS a 
JOIN t AS b ON b.i > a.i AND a.s = b.s
GROUP BY  a.s 

Right? 

My integers are not unique; a given integer can be paired with several 
strings. 

Thanks, 
Mike Spreitzer
SMTP: [email protected], Lotus Notes: Mike Spreitzer/Watson/IBM
Office phone: +1-914-784-6424 (IBM T/L 863-)
AOL Instant Messaging: M1k3Sprtzr 

Peter Brawley <[email protected]> 
06/20/09 03:59 PM 

Please respond to
[email protected]

To
Mike Spreitzer/Watson/i...@ibmus 
cc
[email protected] 
Subject
Re: how to efficiently query for the next in MySQL Community Edition 
5.1.34?

>I do not follow why you suggested a join to get the associated S, 
>that can be done in the original query (and I did NOT say a given 
>integer I is associated with only one string S): 

A Group By query returns arbitrary values for a column which (i) does not 
Group By, (ii) does not aggregate, and (iii) does not have a 1:1 
relationship with the grouping expression.

PB

-----

Mike Spreitzer wrote: 

Ah, yes, the MIN should be very helpful.  Can I expect that ordering the 
storage by (S, I) or having an (S, I) index will make that MIN take O(1) 
time, for both MyISAM and InnoDB? 

I do not follow why you suggested a join to get the associated S, that can 
be done in the original query (and I did NOT say a given integer I is 
associated with only one string S): 

SELECT a.s, a.i, MIN(b.i) AS j 
FROM t AS a 
JOIN t AS b ON b.i > a.i AND a.s = b.s
GROUP BY  a.i 

Thanks, 
Mike Spreitzer

Peter Brawley <[email protected]> 
06/20/09 12:39 PM 

Please respond to
[email protected]

To
Mike Spreitzer/Watson/i...@ibmus 
cc
[email protected] 
Subject
Re: how to efficiently query for the next in MySQL Community Edition 
5.1.34?

Mike, 
>Yes, for each (S, I) pair the goal is to efficiently find the next 
largest 
>integer associated with S in T.  For the highest integer I associated 
with 
>S in T, there is no next larger. 
Here's a more efficient query for the next i values with matching s 
values:

SELECT a.i, MIN(b.i) AS j 
FROM t AS a 
JOIN t AS b ON b.i > a.i AND a.s = b.s
GROUP BY  a.i

To fetch the matching s values, join the above to the original table:

SELECT n.i, t.s, n.j
FROM (
SELECT a.i, MIN(b.i) AS j 
FROM t AS a 
JOIN t AS b ON b.i > a.i AND a.s = b.s
GROUP BY  a.i
) AS n JOIN t USING (i);

PB

-----

Mike Spreitzer wrote: 
Yes, for each (S, I) pair the goal is to efficiently find the next largest 

integer associated with S in T.  For the highest integer I associated with 

S in T, there is no next larger.

Thanks,
Mike Spreitzer

Peter Brawley <[email protected]> 
06/20/09 08:56 AM
Please respond to
[email protected]

To
Mike Spreitzer/Watson/i...@ibmus
cc
[email protected]
Subject
Re: how to efficiently query for the next in MySQL Community Edition 
5.1.34?

Mike

J holding the next integer that T has for S 

You mean for each i, the next value of i with that s?

(U having no row for the last integer of each string).

I do not understand that at all.

PB

Mike Spreitzer wrote: 
Suppose I have a table T with two column, S holding strings (say, 
VARCHAR(200)) and I holding integers.  No row appears twice.  A given 
string appears many times, on average about 100 times.  Suppose I have 
millions of rows.  I want to make a table U holding those same columns 
plus one more, J holding the next integer that T has for S (U having no 
row for the last integer of each string).  I could index T on (S,I) and 
write this query as

select t1.*, t2.I as J from T as t1, T as t2
where t1.S=t2.S and t1.I < t2.I
and not exists (select * from T as t12 where t12.S=t1.S and t1.I < t12.I 
and t12.I < t2.I)

but the query planner says this is quite expensive to run: it will 
enumerate all of T as t1, do a nested enumeration of all t2's entries for 
S=t1.S, and inside that do a further nested enumeration of t12's entries 
for S=t1.S --- costing about 10,000 times the size of T.  There has to be 
a better way!

Thanks,
Mike Spreitzer

No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 8.5.364 / Virus Database: 270.12.80/2187 - Release Date: 06/19/09 

06:53:00

No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 8.5.364 / Virus Database: 270.12.81/2189 - Release Date: 06/20/09 
06:15:00

No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 8.5.364 / Virus Database: 270.12.81/2189 - Release Date: 06/20/09 
06:15:00

No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 8.5.364 / Virus Database: 270.12.81/2189 - Release Date: 06/20/09 
06:15:00

Re: how to efficiently query for the next in MySQL Community Edition 5.1.34?

Reply via email to