Re: JOIN/WHERE and index confusion

Michael Stassen Mon, 02 Aug 2004 10:18:12 -0700

Shawn,

I see what you are saying, but I don't believe it works that way. If that were the case, putting the condition in the ON clause rather than the WHERE clause would be at the top of the list of recommended optimizations in the manual, but it's not. In fact, the manual recommends against it, "You should generally not have any conditions in the ON part that are used to restrict which rows you want in the result set, but rather specify these conditions in the WHERE clause. There are exceptions to this rule." <http://dev.mysql.com/doc/mysql/en/JOIN.html>.

I think you are giving the optimizer (and its authors) way too little credit. If what you say were so,

  SELECT * FROM a, b WHERE a.id=b.id AND a.val = 2;

or

  SELECT * FROM a JOIN b ON a.id=b.id WHERE a.val = 2;

would be dramatically slower (and take a lot more memory) than

  SELECT * FROM a JOIN b ON a.id=b.id AND a.val = 2;

I think that's your point, but, in fact, the EXPLAINs for those 3 queries are identical.

EXPLAIN SELECT * FROM a, b WHERE a.id=b.id AND a.val=2;
+-------+--------+-----------------+---------+-------+------+-------------+
| table | type   | possible_keys   | key     | ref   | rows | Extra       |
+-------+--------+-----------------+---------+-------+------+-------------+
| a     | ref    | PRIMARY,val_idx | val_idx | const | 1883 | Using where |
| b     | eq_ref | PRIMARY         | PRIMARY | a.id  |    1 |             |
+-------+--------+-----------------+---------+-------+------+-------------+

I'm not sure what you mean by "Why does the EXPLAIN of a JOIN show which indexes are being used for each new table if they aren't going to be used?" In all 3 cases the optimizer compares making a depend on b to making b depend on a, and sees that far fewer rows are to be selected in a based on the a.val=2 condition than are to be selected in b (no condition). It uses the index on a.val to quickly select the subset of a's rows with val=2, then uses the index on b.id to match up rows in b.

The STRAIGHT JOIN modifier potentially does 2 things. With 'a STRAIGHT JOIN b', you are declaring that b depends on a. Thus, the optimizer does not consider the 'a depends on b' case. This can speed up queries in the very rare case that the optimizer mistakenly chooses a to depend on b when you know it should do the reverse. Even when the optimizer will get it right, this could speed up a complex join when you know which table should come first by telling the optimizer to skip considering the other possible dependencies and simply use the one you specified. This is documented here <http://dev.mysql.com/doc/mysql/en/LEFT_JOIN_optimisation.html>.

The LEFT JOIN is similar to the STRAIGHT JOIN, in that 'a LEFT JOIN b' declares that b depends on a. So, WHERE conditions can be used to restrict rows in table a pre join, but WHERE conditions can't be used to determine which rows to fetch from b to join to a, they have to be applied afterward. That's how we can look for NULLs in b with the WHERE clause, as you said. That's also why moving a condition from the WHERE clause to the ON clause could help in *certain situations*. If you want the rows of table a with a.val=2 whether they exist in b or not (a.id=b.id), but when they do match you only want the rows where b.type='c', moving the b.type='c' condition to the ON clause would reduce the "virtual table as it is being constructed" and keep the NULL rows, while leaving it in the WHERE would have to filter it afterward and would drop the NULL rows (making the LEFT JOIN pointless).

Michael

[EMAIL PROTECTED] wrote:

Why do I include some WHERE restrictions in my ON clauses and encourage others to do it, too? Because I believe that the WHERE clause is applied AFTER the temporary virtual table (the result of applying all of the JOIN and ON clauses in the full statement) is complete. That is also why I am such a strong proponent of using the explicit JOIN forms of INNER JOIN, RIGHT JOIN, and LEFT JOIN.

I believe that the ON clause in *each* JOIN works to reduce the size of the virtual table *as it is being constructed*. For me, the smaller the virtual table, the better the query's performance. If it doesn't work this way then how are we able to do a LEFT JOIN on two tables and look for null values from the right-hand table as a means of detecting non-matches? Why does the EXPLAIN of a JOIN show which indexes are being used for each new table if they aren't going to be used? Why is there a STRAIGHT JOIN modifier? I believe that if the order of the JOINs and the presence of the ON clauses were not important to performance then there wouldn't be so many tools to help us optimize their performance.
Humbly,
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: JOIN/WHERE and index confusion

Reply via email to