You are absolutely right, I was giving the developers too little credit. I humbly apologize to all who contributed to the query optimizer. It was never my intent to impugn the fine work that has gone into MySQL, my optimization strategy is merely a reflection of my own pessimism. Please forgive my horribly negative attitude but I have been bit in the a** far too many times by those fine developers in Redmond that create a commercial SQL-based database server (and the associated desktop and personal database products) to allow me to assume much when it comes to performance. MySQL is one of those few excellent products that can actually do what it says it can do. Old habits are dying hard.
I agree with your analysis. If WHERE conditions and ON conditions were not pre-selected and processed during the construction of the virtual table, the performance of MySQL would not be a great as it is.(http://dev.mysql.com/doc/mysql/en/Where_optimisations.html) However, my pessimist side assumes that any optimization will be overlooked and that I need to help every way I can. I tend to believe that my WHEREs will not be analyzed until after the virtual table is built (worst case, again). That way I can consciously attempt to minimize the size of the virtual table (even though I *know* that some of my WHERE conditions will be applied as appropriate as it is being constructed) Your analysis of whether to put a condition of a LEFT JOIN into the WHERE clause vs. the ON clause is bang on. As you say, that is the nature of a LEFT JOIN. I also recognize that JOINing is one of the few times that MySQL will apply more than one index to a query. Each joined table has the opportunity to have one if its indexes used during the building of the virtual table. Then, if I understand the process correctly, one of the remaining indexes from one of the source tables participating in the virtual table could be chosen to assist with any remaining WHERE and GROUP BY conditions. That's why the EXPLAIN shows that more than one index is in use when the docs clearly state that at most 1 index (per table) can be used in the processing of any statement. My confusing question about the EXPLAIN output was meant to be more rhetorical than to represent actual puzzlement. I had two reasons why I suggested that Eamon try the STRAIGHT JOIN. First, the biggest impact on his virtual table size would come from applying his WHERE clause directly to table a. Second, according to "Where_optimizations" (referenced above), "If all columns in ORDER BY and GROUP BY clauses come from the same table, that table is preferred first when joining. ". His GROUP BY was on two columns from his c table which meant that the optimizer would have started at c and could have continued to build the virtual table by JOINing tables d and e to it until it finally added a (which should have stripped out most of c, d, and e from the virtual table). This was my imagined worst case scenario, that a would be JOINed to the virtual table second to last. (sorry designers! :-) ) One day really soon I am going to have to sit down and crawl through the optimizer code and convince myself that I don't have to be such a back-seat driver. I just haven't had the time. Warmest Wishes! Shawn Green Database Administrator Unimin Corporation - Spruce Pine Michael Stassen <[EMAIL PROTECTED]> wrote on 08/02/2004 01:19:02 PM: > Shawn, > > I see what you are saying, but I don't believe it works that way. If that > were the case, putting the condition in the ON clause rather than the WHERE > clause would be at the top of the list of recommended optimizations in the > manual, but it's not. In fact, the manual recommends against it, "You > should generally not have any conditions in the ON part that are used to > restrict which rows you want in the result set, but rather specify these > conditions in the WHERE clause. There are exceptions to this rule." > <http://dev.mysql.com/doc/mysql/en/JOIN.html>. > > I think you are giving the optimizer (and its authors) way too little > credit. If what you say were so, > > SELECT * FROM a, b WHERE a.id=b.id AND a.val = 2; > > or > > SELECT * FROM a JOIN b ON a.id=b.id WHERE a.val = 2; > > would be dramatically slower (and take a lot more memory) than > > SELECT * FROM a JOIN b ON a.id=b.id AND a.val = 2; > > I think that's your point, but, in fact, the EXPLAINs for those 3 queries > are identical. > > EXPLAIN SELECT * FROM a, b WHERE a.id=b.id AND a.val=2; > +-------+--------+-----------------+---------+-------+------+-------------+ > | table | type | possible_keys | key | ref | rows | Extra | > +-------+--------+-----------------+---------+-------+------+-------------+ > | a | ref | PRIMARY,val_idx | val_idx | const | 1883 | Using where | > | b | eq_ref | PRIMARY | PRIMARY | a.id | 1 | | > +-------+--------+-----------------+---------+-------+------+-------------+ > > I'm not sure what you mean by "Why does the EXPLAIN of a JOIN show which > indexes are being used for each new table if they aren't going to be used?" > In all 3 cases the optimizer compares making a depend on b to making b > depend on a, and sees that far fewer rows are to be selected in a based on > the a.val=2 condition than are to be selected in b (no condition). It uses > the index on a.val to quickly select the subset of a's rows with val=2, then > uses the index on b.id to match up rows in b. > > The STRAIGHT JOIN modifier potentially does 2 things. With 'a STRAIGHT JOIN > b', you are declaring that b depends on a. Thus, the optimizer does not > consider the 'a depends on b' case. This can speed up queries in the very > rare case that the optimizer mistakenly chooses a to depend on b when you > know it should do the reverse. Even when the optimizer will get it right, > this could speed up a complex join when you know which table should come > first by telling the optimizer to skip considering the other possible > dependencies and simply use the one you specified. This is documented here > <http://dev.mysql.com/doc/mysql/en/LEFT_JOIN_optimisation.html>. > > The LEFT JOIN is similar to the STRAIGHT JOIN, in that 'a LEFT JOIN b' > declares that b depends on a. So, WHERE conditions can be used to restrict > rows in table a pre join, but WHERE conditions can't be used to determine > which rows to fetch from b to join to a, they have to be applied afterward. > That's how we can look for NULLs in b with the WHERE clause, as you said. > That's also why moving a condition from the WHERE clause to the ON clause > could help in *certain situations*. If you want the rows of table a with > a.val=2 whether they exist in b or not (a.id=b.id), but when they do match > you only want the rows where b.type='c', moving the b.type='c' condition to > the ON clause would reduce the "virtual table as it is being constructed" > and keep the NULL rows, while leaving it in the WHERE would have to filter > it afterward and would drop the NULL rows (making the LEFT JOIN pointless). > > Michael > > [EMAIL PROTECTED] wrote: > > Why do I include some WHERE restrictions in my ON clauses and encourage > > others to do it, too? Because I believe that the WHERE clause is applied > > AFTER the temporary virtual table (the result of applying all of the JOIN > > and ON clauses in the full statement) is complete. That is also why I am > > such a strong proponent of using the explicit JOIN forms of INNER JOIN, > > RIGHT JOIN, and LEFT JOIN. > > > > I believe that the ON clause in *each* JOIN works to reduce the size of > > the virtual table *as it is being constructed*. For me, the smaller the > > virtual table, the better the query's performance. If it doesn't work > > this way then how are we able to do a LEFT JOIN on two tables and look for > > null values from the right-hand table as a means of detecting non-matches? > > Why does the EXPLAIN of a JOIN show which indexes are being used for each > > new table if they aren't going to be used? Why is there a STRAIGHT JOIN > > modifier? I believe that if the order of the JOINs and the presence of the > > ON clauses were not important to performance then there wouldn't be so > > many tools to help us optimize their performance. > > > > Humbly, > > Shawn Green > > Database Administrator > > Unimin Corporation - Spruce Pine >