Re: JOIN/WHERE and index confusion

SGreen Mon, 02 Aug 2004 12:04:01 -0700

You are absolutely right, I was giving the developers too little credit. I 
humbly apologize to all who contributed to the query optimizer. It was 
never my intent to impugn the fine work that has gone into MySQL, my 
optimization strategy is merely a reflection of my own pessimism. Please 
forgive my horribly negative attitude but I have been bit in the a** far 
too many times by those fine developers in Redmond that create a 
commercial SQL-based database server (and the associated desktop and 
personal database products) to allow me to assume much when it comes to 
performance.  MySQL is one of those few excellent products that can 
actually do what it says it can do.  Old habits are dying hard.

I agree with your analysis. If WHERE conditions and ON conditions were not 
pre-selected and processed during the construction of the virtual table, 
the performance of MySQL would not be a great as it 
is.(http://dev.mysql.com/doc/mysql/en/Where_optimisations.html)  However, 
my pessimist side assumes that any optimization will be overlooked and 
that I need to help every way I can.  I tend to believe that my WHEREs 
will not be analyzed until after the virtual table is built (worst case, 
again). That way I can consciously attempt to minimize the size of the 
virtual table (even though I *know* that some of my WHERE conditions will 
be applied as appropriate as it  is being constructed)

Your analysis of whether to put a condition of a LEFT JOIN  into the WHERE 
clause vs. the ON clause is bang on. As you say, that is the nature of a 
LEFT JOIN.

I also recognize that JOINing is one of the few times that MySQL will 
apply more than one index to a query. Each joined table has the 
opportunity to have one if its indexes used during the building of the 
virtual table. Then, if I understand the process correctly, one of the 
remaining indexes from one of the source tables participating in the 
virtual table could be chosen to assist with any remaining WHERE and GROUP 
BY conditions. That's why the EXPLAIN shows that more than one index is in 
use when the docs clearly state that at most 1 index (per table) can be 
used in the processing of any statement. My confusing question about the 
EXPLAIN output was meant to be more rhetorical than to represent actual 
puzzlement.

I had two reasons why I suggested that Eamon try the STRAIGHT JOIN. First, 
the biggest impact on his virtual table size would come from applying his 
WHERE clause directly to table a. Second, according to 
"Where_optimizations" (referenced above), "If all columns in ORDER BY and 
GROUP BY clauses come from the same table, that table is preferred first 
when joining. ". His GROUP BY was on two columns from his c table which 
meant that the optimizer would have started at c and could have continued 
to build the virtual table by JOINing tables d and e to it until it 
finally added a (which should have stripped out most of c, d, and e from 
the virtual table). This was my imagined worst case scenario, that a would 
be JOINed to the virtual table second to last. (sorry designers! :-) ) 

One day really soon I am going to have to sit down and crawl through the 
optimizer code and convince myself that I don't have to be such a 
back-seat driver. I just haven't had the time.

Warmest Wishes!
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

Michael Stassen <[EMAIL PROTECTED]> wrote on 08/02/2004 01:19:02 
PM:

> Shawn,
> 
> I see what you are saying, but I don't believe it works that way.  If 
that 
> were the case, putting the condition in the ON clause rather than the 
WHERE 
> clause would be at the top of the list of recommended optimizations in 
the 
> manual, but it's not.  In fact, the manual recommends against it, "You 
> should generally not have any conditions in the ON part that are used to 

> restrict which rows you want in the result set, but rather specify these 

> conditions in the WHERE clause. There are exceptions to this rule." 
> <http://dev.mysql.com/doc/mysql/en/JOIN.html>.
> 
> I think you are giving the optimizer (and its authors) way too little 
> credit.  If what you say were so,
> 
>    SELECT * FROM a, b WHERE a.id=b.id AND a.val = 2;
> 
> or
> 
>    SELECT * FROM a JOIN b ON a.id=b.id WHERE a.val = 2;
> 
> would be dramatically slower (and take a lot more memory) than
> 
>    SELECT * FROM a JOIN b ON a.id=b.id AND a.val = 2;
> 
> I think that's your point, but, in fact, the EXPLAINs for those 3 
queries 
> are identical.
> 
> EXPLAIN SELECT * FROM a, b WHERE a.id=b.id AND a.val=2;
> 
+-------+--------+-----------------+---------+-------+------+-------------+
> | table | type   | possible_keys   | key     | ref   | rows | Extra  |
> 
+-------+--------+-----------------+---------+-------+------+-------------+
> | a     | ref    | PRIMARY,val_idx | val_idx | const | 1883 | Using 
where |
> | b     | eq_ref | PRIMARY         | PRIMARY | a.id  |    1 |  |
> 
+-------+--------+-----------------+---------+-------+------+-------------+
> 
> I'm not sure what you mean by "Why does the EXPLAIN of a JOIN show which 

> indexes are being used for each new table if they aren't going to be 
used?"
> In all 3 cases the optimizer compares making a depend on b to making b 
> depend on a, and sees that far fewer rows are to be selected in a based 
on 
> the a.val=2 condition than are to be selected in b (no condition).  It 
uses 
> the index on a.val to quickly select the subset of a's rows with val=2, 
then 
> uses the index on b.id to match up rows in b.
> 
> The STRAIGHT JOIN modifier potentially does 2 things.  With 'a STRAIGHT 
JOIN 
> b', you are declaring that b depends on a.  Thus, the optimizer does not 

> consider the 'a depends on b' case.  This can speed up queries in the 
very 
> rare case that the optimizer mistakenly chooses a to depend on b when 
you 
> know it should do the reverse.  Even when the optimizer will get it 
right, 
> this could speed up a complex join when you know which table should come 

> first by telling the optimizer to skip considering the other possible 
> dependencies and simply use the one you specified.  This is documented 
here 
> <http://dev.mysql.com/doc/mysql/en/LEFT_JOIN_optimisation.html>.
> 
> The LEFT JOIN is similar to the STRAIGHT JOIN, in that 'a LEFT JOIN b' 
> declares that b depends on a.  So, WHERE conditions can be used to 
restrict 
> rows in table a pre join, but WHERE conditions can't be used to 
determine 
> which rows to fetch from b to join to a, they have to be applied 
afterward. 
>   That's how we can look for NULLs in b with the WHERE clause, as you 
said. 
> That's also why moving a condition from the WHERE clause to the ON 
clause 
> could help in *certain situations*.  If you want the rows of table a 
with 
> a.val=2 whether they exist in b or not (a.id=b.id), but when they do 
match 
> you only want the rows where b.type='c', moving the b.type='c' condition 
to 
> the ON clause would reduce the "virtual table as it is being 
constructed" 
> and keep the NULL rows, while leaving it in the WHERE would have to 
filter 
> it afterward and would drop the NULL rows (making the LEFT JOIN 
pointless).
> 
> Michael
> 
> [EMAIL PROTECTED] wrote:
> > Why do I include some WHERE restrictions in my ON clauses and 
encourage 
> > others to do it, too? Because I believe that the WHERE clause is 
applied 
> > AFTER the temporary virtual table (the result of applying all of the 
JOIN 
> > and ON clauses in the full statement) is complete.  That is also why I 
am 
> > such a strong proponent of using the explicit JOIN forms of INNER 
JOIN, 
> > RIGHT JOIN, and LEFT JOIN. 
> > 
> > I believe that the ON clause in *each* JOIN works to reduce the size 
of 
> > the virtual table *as it is being constructed*.  For me, the smaller 
the 
> > virtual table, the better the query's performance.  If it doesn't work 

> > this way then how are we able to do a LEFT JOIN on two tables and look 
for 
> > null values from the right-hand table as a means of detecting 
non-matches? 
> > Why does the EXPLAIN of a JOIN show which indexes are being used for 
each 
> > new table if they aren't going to be used? Why is there a STRAIGHT 
JOIN 
> > modifier? I believe that if the order of the JOINs and the presence of 
the 
> > ON clauses were not important to performance then there wouldn't be so 

> > many tools to help us optimize their performance.
> > 
> > Humbly,
> > Shawn Green
> > Database Administrator
> > Unimin Corporation - Spruce Pine
>

Re: JOIN/WHERE and index confusion

Reply via email to