Re: [sqlite] Equiv stmts, different explain plans

Keith Medcalf Tue, 05 Mar 2019 11:09:37 -0800

On Tuesday, 5 March, 2019 04:09, Simon Slavin <slav...@bigfraud.org> wrote:

>On 5 Mar 2019, at 2:06am, kk <kmccord0...@gmail.com> wrote:

>>>> select * from t1
>>>>    where c=1 and d in (select d from t2 where c=1);
>>>> select * from t1
>>>>    where c=1 and d in (select d from t2 where t2.c=t1.c);

>> DRH, many thanks for your reply, I was expecting same output
>because I believe stmts to be equivalent, so was not sure why query
>plan was different. I see the explain plans are very similar.
>> But I believe original stmts mentioned are still equivalent?

>How do you expect a SQL engine to approach the above statements ?
>Should it process the inner SELECT first, or the outer SELECT first ?

>If it processes the inner SELECT first, where does it get the value
>it needs for t1.c ?

>If it processes the outer SELECT first, what strategy does it use for
>selecting on t1.d when it doesn't yet know whether there's going to
>be no, a single, or multiple values ?

>> Do you agree? And in SQLite what is best way to write such stmt (or
>in other terms, what is difference)?
>
>Using a JOIN.
>
>SELECT t1.* FROM t1
>    INNER JOIN t2 ON t2.c=1 AND t2.d = t1.d
>    WHERE t1.c=1;

Technically this is invalid.  t2.c == 1 is NOT a equijoin condition and should 
NOT appear in the ON clause (though it would be valid in the case of an outer 
join).  However, since [INNER] JOIN is merely syntactic sugar for a , and the 
ON is merely syntactic sugar for a WHERE clause, this does not really matter 
much.  Moreover, the query plans will be different because in this case you are 
joining T1 against T2 using only the common column d, and then filtering the 
interim results based on t1.c and t2.c.  (The actual plan will use the c==1 
condition to constrain the outer loop, then loop through the inner table based 
on the join column d, then filter the result of that with the "other" c==1 
condition -- which table is chosen as the outer table is up to the query 
planner (and it will choose whichever one it things the constraint c==1 will 
produce the least rows)).

This is entirely different from the below query where you are joining T1 and T2 
on the common columns c and d, then filtering for a specific value of c in t1 
(which will constrain the outer loop).

The total number of candidate solutions (the number of intersects in the nested 
loops) can be quite different.

>SELECT t1.* FROM t1
>    INNER JOIN t2 ON t2.c=t1.c AND t2.d = t1.d
>    WHERE t1.c=1;

>The INNER JOIN (as opposed to OUTER JOIN) means that a row must exist
>in t2 for the equivalent row in t1 to be returned.  INNER is the
>default kind of JOIN.  Of the two statements, it seems that the fist
>one requires less processing.

Actually, the latter (the properly phrased equijoin) will require the least 
processing since it gives the query planner the greatest latitude to generate 
an optimal solution.  In one case (the former) you have the condition "t1.c == 
1 AND t2.c == 1".  The query planner cannot possibly derive from this the fact 
that "t1.c == t2.c".  However, in the latter case, you have the condition "t1.c 
== t2.c AND t1.c == 1" from which the query planner can determine the fact that 
"t2.c == 1".  Depending on the shape of the data and the available statistics 
this may have a great effect on the performance of the query because the query 
planner has more information and may choose a more optimal solution (that is, 
it may now choose whether t1 or t2 is the outer loop, and the descent into the 
inner table uses both columns).

This is, of course, also fully equivalent to the properly phrased JOIN but it 
does constrain the solution (it is equivalent to specifying "t1 CROSS JOIN t2" 
in the above properly phrased equijoin in that it constrains the order of 
traversal of the tables):

SELECT *
  FROM t1
 WHERE c == 1
   AND EXISTS (SELECT * FROM t2 WHERE c == t1.c AND d == t1.d)
;

since no actual data from t2 needs to be returned.  Which "phrasing" of the 
myriad of perhaps different queries producing (quite possibly) the same (or 
perhaps different) results probably depends on how one "translates" the 
original problem statement from English into SQL (and the adequacy of that 
problem statement) together with the maintainability of the application and the 
actual schema including the indexes ...

>Internally, SQLite does comparison and conversion when faced with
>different ways of phrasing your query.  But that's not your problem.
>Phrase what you want in as specific terms as possible, and let SQLite
>pick its preferred way of solving the problem.

Of course, which one chooses depends entirely on the original problem 
statement.  If one "assumes" that the t1.c=1 and t2.c=1 in the original 
statement:

select * from t1
   where c=1 and d in (select d from t2 where c=1);

is merely a lazy programmer expressing that t1.c == t2.c without actually 
saying so, then the properly phrased equijoin is the most efficient because it 
gives the query planner the most latitude to generate an optimal solution.  
This can only be determined by reference to the original statement of the 
problem that the SQL statement was designed to satisfy (since they are perhaps 
all the same yet different at the same time, and there is really insufficient 
information from which to make a decision).

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Equiv stmts, different explain plans

Reply via email to