Hello List, Please forgive this rather lengthy post. I thought I had something worked out to answer Rob's question but when I put it to the test I found what MAY be a bug in 4.1.1a-alpha-nt-log. Here is what I did.
I created two tables, tablea and tableb. Here are their defs: mysql> show create table tablea \G *************************** 1. row *************************** Table: tablea Create Table: CREATE TABLE `tablea` ( `IP` varchar(16) default NULL ) ENGINE=MyISAM DEFAULT CHARSET=latin1 1 row in set (0.00 sec) mysql> show create table tableb \G *************************** 1. row *************************** Table: tableb Create Table: CREATE TABLE `tableb` ( `IP` varchar(16) default NULL ) ENGINE=MyISAM DEFAULT CHARSET=latin1 1 row in set (0.00 sec) I then populated them with sample data. I used odd IP addresses for tablea and even IP addresses for tableb except for .3 which exists in both. I created multiple rows for each IP to simulate a set of (extremely simplified) web server logs. Here are the table dumps: mysql> select * from tablea; +-------------+ | IP | +-------------+ | 192.168.1.1 | | 192.168.1.1 | | 192.168.1.1 | | 192.168.1.3 | | 192.168.1.3 | | 192.168.1.3 | | 192.168.1.5 | | 192.168.1.5 | | 192.168.1.5 | | 192.168.1.3 | +-------------+ 10 rows in set (0.00 sec) mysql> select * from tableb; +-------------+ | IP | +-------------+ | 192.168.1.2 | | 192.168.1.2 | | 192.168.1.2 | | 192.168.1.4 | | 192.168.1.4 | | 192.168.1.4 | | 192.168.1.6 | | 192.168.1.6 | | 192.168.1.6 | | 192.168.1.8 | | 192.168.1.8 | | 192.168.1.8 | | 192.168.1.3 | | 192.168.1.3 | +-------------+ 14 rows in set (0.00 sec) What I was attempting to do was to try to work around the lack of FULL OUTER JOIN capability in order to help Rob find out how many unique IP addresses exist between the two tables. I crafted the following statement: SELECT COUNT(DISTINCT IFNULL(a.ip, c.ip)) as IP_Count FROM tablea a LEFT JOIN tableb b ON b.ip = a.ip RIGHT JOIN tableb c ON c.ip = a.ip; The answer I got back was 3...... hmmmm. To try to see what was being "counted distinctly" I rewrote the query to read: SELECT a.ip as a, b.ip as b, c.ip as b FROM tablea a LEFT JOIN tableb b ON b.ip = a.ip RIGHT JOIN tableb c ON c.ip = a.ip; and got 148 rows. What had actually happened was the query engine created a cartesian product of the two tables. Here is a partial dump of the output: +-------------+-------------+-------------+ | a | b | c | +-------------+-------------+-------------+ | 192.168.1.1 | NULL | 192.168.1.2 | ----- clipped for space ----- | 192.168.1.5 | NULL | 192.168.1.8 | | 192.168.1.5 | NULL | 192.168.1.8 | | 192.168.1.5 | NULL | 192.168.1.8 | | 192.168.1.3 | NULL | 192.168.1.8 | | 192.168.1.1 | NULL | 192.168.1.3 | | 192.168.1.1 | NULL | 192.168.1.3 | | 192.168.1.1 | NULL | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | | 192.168.1.5 | NULL | 192.168.1.3 | | 192.168.1.5 | NULL | 192.168.1.3 | | 192.168.1.5 | NULL | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | | 192.168.1.1 | NULL | 192.168.1.3 | | 192.168.1.1 | NULL | 192.168.1.3 | | 192.168.1.1 | NULL | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | | 192.168.1.5 | NULL | 192.168.1.3 | | 192.168.1.5 | NULL | 192.168.1.3 | | 192.168.1.5 | NULL | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | 192.168.1.3 | +-------------+-------------+-------------+ 148 rows in set (0.00 sec) All of the rows that were clipped were combinations of tablea and tableb values with the "b" column always coming back as null. Column "a" and column "c" were never NULL. Here is why I think it's a bug. If we break down the FROM clause we get two situations. I will refer to the tables by their aliases so that the physical names don't get confusing: FROM tablea A LEFT JOIN tableb B ON B.ip = A.ip and FROM tablea A RIGHT JOIN tableb C ON C.ip = B.ip; If A left joins to B first we get a set that consists of all of A's rows with some colmns of B data populated in where they match (8 rows of B data thanks to the duplicate matches on 192.168.1.3). If we then call that result set T and right join C to T on a.ip (of T) matching c.ip we will get all of C's rows with a few of the columns from T populated where they match. Shouldn't I have gotten just 28 rows back (the 12 rows of non-matching C values plus the 16 rows where the two datasets match up on the value 192.168.1.3)? (subset T) +-------------+-------------+ | a.ip | b.ip | +-------------+-------------+ | 192.168.1.1 | NULL | | 192.168.1.1 | NULL | | 192.168.1.1 | NULL | | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | | 192.168.1.5 | NULL | | 192.168.1.5 | NULL | | 192.168.1.5 | NULL | | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | +-------------+-------------+ 14 rows If I work the joins in the opposite order: Right Join C to A first then left join those results (T2) to B. C to A results in 8 rows of table A data and all of the rest of the C data for 20 rows total (12 rows where C didn't match up with A plus 8 rows of 192.168.1.3 data) == T2. Now left join T2 to B (joining the A data to match the B data) and I should get the 8 rows of the A data of T2 times the 2 rows of B (matching at 192.168.1.3) giving me 16 rows of B populated data plus the other 12 rows of T2 data where the A-sourced columns were empty. That only gives me 38 rows. (subset T2) +-------------+-------------+ | a.ip | c.ip | +-------------+-------------+ | NULL | 192.168.1.2 | | NULL | 192.168.1.2 | | NULL | 192.168.1.2 | | NULL | 192.168.1.4 | | NULL | 192.168.1.4 | | NULL | 192.168.1.4 | | NULL | 192.168.1.6 | | NULL | 192.168.1.6 | | NULL | 192.168.1.6 | | NULL | 192.168.1.8 | | NULL | 192.168.1.8 | | NULL | 192.168.1.8 | | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | | 192.168.1.3 | 192.168.1.3 | +-------------+-------------+ 20 rows Have I analyzed this situation correctly? I understand how a cartesian product can result in 148 rows (thanks to the duplicate matches in each table) but I don't think that was what I asked the engine to produce. Is this a flaw in the engine or a flaw in my logic. Please help!!! Thanks in advance! Shawn Green Database Administrator Unimin Corporation - Spruce Pine -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]