Re: HashJoinStrategy.feasible

Army Wed, 11 Oct 2006 15:33:24 -0700

James Synge wrote:

Army mentioned earlier the analysis he wrote for Derby-781 (Subquery
materialization Hash Join). I've read most of that,

Thanks for reading the DERBY-781 writeup. It's always nice to know other peopleare double-checking the analysis and asking questions--especially with theoptimizer since, as you probably know by now, there's an endless supply ofanalysis to do and questions to ask...

and am confused by his analysis of this select:

Q2. select 1 from tt, (t left outer join (s left outer join r on (f =
i)) on (b = e)) where (j=g)
Judging from the analysis, this query could be less ambiguouslyexpressed as:
select 1
from
   tt as T0,
   (t as T1 left outer join
    (s as T2 left outer join r as T3 on (T2.f = T3.i))
    on (T1.b = T2.e))
where (T0.j=T3.g)

In particular, I don't understand the claim that the subqueries can
not be materialized because of the fact that the (j=g) and (b=e)predicates create a correlation between the levels.

I'm glad you mentioned the predicate (b=e). In the discussion I focused on(j=g) but as it turns out, it's actually (b=e) that makes the difference in thisparticular query. The logic for both queries is the same (w.r.t to thedifferent nesting levels) but for this specific scenario, (b=e) is the predicateof importance.

I can't see any theoretical reason why the following table expression
couldn't be materialized:

   s as T2 left outer join r as T3 on (T2.f = T3.i)

This expression in and of itself can in fact be materialized. In this*isolated* query the predicate is between two tables that are at the samenesting level--namely, T2 and T3. So if all we look at is (T2.f=T3.i), then yes,as you say, there is no correlation between this expression and the rest of theFROM clause, and thus this piece can be safely materialized.

But in the bigger query, (T2.f=T3.i) is not the only predicate. We also have(T0.j=T3.g) and (T1.b=T2.e). And as I mentioned above, it's the predicate(T1.b=T2.e) which, for this example, renders materialization unsafe.

Furthermore, there is no correlation between:

   (t as T1 left outer join (s as T2 left outer join r as T3 on (T2.f
= T3.i)) on (T1.b = T2.e))

And the rest of the from clause (just TT in this case).

The correlation comes via the predicate (T0.j=T3.g), which is specified as partof the WHERE clause in the top-level query. That predicate joins TT with theHalfOuterJoinNode, hence the correlation. (Is that what you were asking?)

My understanding of the semantics of select are that (logically) the
FROM clause is executed first to produce a working table (in thiscase a CROSS JOIN of TT with the result of the left outer joins,which are evaluated independently of TT).

Logically, I think you are right here. Put differently, I think what you'reexpecting is that the HalfOuterJoin node should be evaluated to some result set,and then the hash join can occur between TT and that result set using(T0.j=T3.g) as the equijoin predicate.

Or, in the case of (T1.b=T2.e), JoinNode[4] should be evaluated to some resultset and then the hash join can occur between T1 and that JoinNode.

I agree that logically it seems like this should work. But alas, it seems likesomething under the covers does not follow this logical separation of hash joinvs. result-set-creation. Take, for example, the following tables and data, andthen run the less ambiguous version of your query:


CREATE TABLE T (A INT NOT NULL, B DECIMAL(10,3) NOT NULL, C VARCHAR(5) NOT 
NULL);
INSERT INTO T VALUES (1, 1.0, '1'), (2, 2.0, '2'), (3, 3.0, '3');

CREATE TABLE S (D INT NOT NULL, E DECIMAL(10,3) NOT NULL, F VARCHAR(5) NOT 
NULL);
INSERT INTO S VALUES (2, 2.0, '2'), (3, 3.0, '3'), (4, 4.0, '4');

CREATE TABLE R (G INT NOT NULL, H DECIMAL(10,3) NOT NULL, I VARCHAR(5) NOT 
NULL);
INSERT INTO R VALUES (3, 3.0, '3'), (4, 4.0, '4'), (5, 5.0, '5');

CREATE TABLE TT (J INT NOT NULL, K DECIMAL(10,3) NOT NULL, L VARCHAR(5) NOT 
NULL);
INSERT INTO TT VALUES (2, 2.0, '2'), (3, 3.0, '3'), (4, 4.0, '4');

Currently Derby will (correctly) return one row:

ij> select t0.*
 from
    tt as T0,
    (t as T1 left outer join
     (s as T2 left outer join r as T3 on (T2.f = T3.i))
     on (T1.b = T2.e))
 where (T0.j=T3.g);
J          |K           |L
------------------------------
3          |3.000       |3

But without the changes for DERBY-781 that led to this discussion, the queryreturns zero rows. The difference between the two plans is that, in the latter,the optimizer does a hash join between T1 and JoinNode[4], but in the former itdoes not.

So given that I came to the conclusion that hash joins are unsafe if thepredicate references FromTables at different nesting levels. And I think thatconclusion is still correct based on the Derby behavior.

I cannot, however, say for certain *why* such a hash join is unsafe. I thinkyou're right in asking this question and I'm glad you did...but I don't have agood answer. Perhaps something is in fact amiss somewhere?

My thinking (fwiw): the in-memory hash table for non-base-tables is representedby a HashTableNode that is created in the "modifyAccessPaths()" method ofProjectRestrictNode. At generation time the join predicate is generated as aqualifier for the HashTableNode's child--i.e. for JoinNode[4] in our example.(See the generateMinion() method in HashTableNode). My understanding is that,at execution time, the JoinNode[4] result set is materialized and probed withthe first row in T1. Then for subsequent rows in T1, the materialized resultset is simply probed with the appropriate column value from the T1 row.

If that's correct, I wonder if HashTableNode's qualifier is somehow beingevaluated incorrectly against the materialization of JoinNode[4], leading tomissing rows? Unfortunately, I don't know what the answer to that question is...

So long story short: theoretically it seems like hash joins between FromTablesat different nesting levels should be possible. But in actuality such joins canlead to incorrect results. I addressed this by disallowing such hash joins asdescribed in the DERBY-781 document. But you're right, maybe this is somethingworth investigating more...?


Army

Re: HashJoinStrategy.feasible

Reply via email to