Re: select Distinct question

SGreen Thu, 13 May 2004 14:22:14 -0700

Hello List,

Please forgive this rather lengthy post. I thought I had something worked
out to answer Rob's question but when I put it to the test I found what MAY
be a bug in 4.1.1a-alpha-nt-log. Here is what I did.


I created two tables, tablea and tableb. Here are their defs:

mysql> show create table tablea \G
*************************** 1. row ***************************
       Table: tablea
Create Table: CREATE TABLE `tablea` (
  `IP` varchar(16) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> show create table tableb \G
*************************** 1. row ***************************
       Table: tableb
Create Table: CREATE TABLE `tableb` (
  `IP` varchar(16) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

I then populated them with sample data. I used odd IP addresses for tablea
and even IP addresses for tableb except for .3 which exists in both. I
created multiple rows for each IP to simulate a set of (extremely
simplified) web server logs.

Here are the table dumps:

mysql> select * from tablea;
+-------------+
| IP          |
+-------------+
| 192.168.1.1 |
| 192.168.1.1 |
| 192.168.1.1 |
| 192.168.1.3 |
| 192.168.1.3 |
| 192.168.1.3 |
| 192.168.1.5 |
| 192.168.1.5 |
| 192.168.1.5 |
| 192.168.1.3 |
+-------------+
10 rows in set (0.00 sec)

mysql> select * from tableb;
+-------------+
| IP          |
+-------------+
| 192.168.1.2 |
| 192.168.1.2 |
| 192.168.1.2 |
| 192.168.1.4 |
| 192.168.1.4 |
| 192.168.1.4 |
| 192.168.1.6 |
| 192.168.1.6 |
| 192.168.1.6 |
| 192.168.1.8 |
| 192.168.1.8 |
| 192.168.1.8 |
| 192.168.1.3 |
| 192.168.1.3 |
+-------------+
14 rows in set (0.00 sec)

What I was attempting to do was to try to work around the lack of FULL
OUTER JOIN capability in order to help Rob find out how many unique IP
addresses exist between the two tables.  I crafted the following statement:

SELECT COUNT(DISTINCT IFNULL(a.ip, c.ip)) as IP_Count
FROM tablea a
LEFT JOIN tableb b
      ON b.ip = a.ip
RIGHT JOIN tableb c
      ON c.ip = a.ip;

The answer I got back was 3...... hmmmm. To try to see what was being
"counted distinctly" I rewrote the query to read:

SELECT a.ip as a, b.ip as b, c.ip as b
FROM tablea a
LEFT JOIN tableb b
      ON b.ip = a.ip
RIGHT JOIN tableb c
      ON c.ip = a.ip;

and got 148 rows. What had actually happened was the query engine created a
cartesian product of the two tables. Here is a partial dump of the output:
+-------------+-------------+-------------+
| a           | b           | c           |
+-------------+-------------+-------------+
| 192.168.1.1 | NULL        | 192.168.1.2 |
       ----- clipped for space -----
| 192.168.1.5 | NULL        | 192.168.1.8 |
| 192.168.1.5 | NULL        | 192.168.1.8 |
| 192.168.1.5 | NULL        | 192.168.1.8 |
| 192.168.1.3 | NULL        | 192.168.1.8 |
| 192.168.1.1 | NULL        | 192.168.1.3 |
| 192.168.1.1 | NULL        | 192.168.1.3 |
| 192.168.1.1 | NULL        | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.5 | NULL        | 192.168.1.3 |
| 192.168.1.5 | NULL        | 192.168.1.3 |
| 192.168.1.5 | NULL        | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.1 | NULL        | 192.168.1.3 |
| 192.168.1.1 | NULL        | 192.168.1.3 |
| 192.168.1.1 | NULL        | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.5 | NULL        | 192.168.1.3 |
| 192.168.1.5 | NULL        | 192.168.1.3 |
| 192.168.1.5 | NULL        | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
+-------------+-------------+-------------+
148 rows in set (0.00 sec)

All of the rows that were clipped were combinations of tablea and tableb
values with the "b" column always coming back as null. Column "a" and
column "c" were never NULL.

Here is why I think it's a bug.  If we break down the FROM clause we get
two situations. I will refer to the tables by their aliases so that the
physical names don't get confusing:

FROM tablea A
LEFT JOIN tableb B
      ON B.ip = A.ip

and

FROM tablea A
RIGHT JOIN tableb C
      ON C.ip = B.ip;

If A left joins to B first we get a set that consists of all of A's rows
with some colmns of B data populated in where they match (8 rows of B data
thanks to the duplicate matches on 192.168.1.3). If we then call that
result set T and right join C to T on a.ip (of T) matching c.ip we will get
all of C's rows with a few of the columns from T populated where they
match. Shouldn't I have gotten just 28 rows back (the 12 rows of
non-matching C values plus the 16 rows where the two datasets match up on
the value 192.168.1.3)?

(subset T)
+-------------+-------------+
| a.ip        | b.ip        |
+-------------+-------------+
| 192.168.1.1 | NULL        |
| 192.168.1.1 | NULL        |
| 192.168.1.1 | NULL        |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.5 | NULL        |
| 192.168.1.5 | NULL        |
| 192.168.1.5 | NULL        |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
+-------------+-------------+
14 rows



If I work the joins in the opposite order: Right Join C to A first then
left join those results (T2) to B. C to A results in 8 rows of table A data
and all of the rest of the C data for 20 rows total (12 rows where C didn't
match up with A plus 8 rows of 192.168.1.3 data) == T2.  Now left join T2
to B (joining the A data to match the B data) and I should get the 8 rows
of the A data of T2 times the 2 rows of B (matching at 192.168.1.3) giving
me 16 rows of B populated data plus the other 12 rows of T2 data where the
A-sourced columns were empty. That only gives me 38 rows.

(subset T2)
+-------------+-------------+
| a.ip        | c.ip        |
+-------------+-------------+
| NULL        | 192.168.1.2 |
| NULL        | 192.168.1.2 |
| NULL        | 192.168.1.2 |
| NULL        | 192.168.1.4 |
| NULL        | 192.168.1.4 |
| NULL        | 192.168.1.4 |
| NULL        | 192.168.1.6 |
| NULL        | 192.168.1.6 |
| NULL        | 192.168.1.6 |
| NULL        | 192.168.1.8 |
| NULL        | 192.168.1.8 |
| NULL        | 192.168.1.8 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
+-------------+-------------+
20 rows

Have I analyzed this situation correctly? I understand how a cartesian
product can result in 148 rows (thanks to the duplicate matches in each
table) but I don't think that was what I asked the engine to produce. Is
this a flaw in the engine or a flaw in my logic. Please help!!!

Thanks in advance!

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: select Distinct question

Reply via email to