James Synge wrote:
Army mentioned earlier the analysis he wrote for Derby-781 (Subquery
materialization Hash Join). I've read most of that,

Thanks for reading the DERBY-781 writeup. It's always nice to know other people are double-checking the analysis and asking questions--especially with the optimizer since, as you probably know by now, there's an endless supply of analysis to do and questions to ask...

and am confused by his analysis of this select:

Q2. select 1 from tt, (t left outer join (s left outer join r on (f =
i)) on (b = e)) where (j=g)

Judging from the analysis, this query could be less ambiguously expressed as:

select 1
from
   tt as T0,
   (t as T1 left outer join
    (s as T2 left outer join r as T3 on (T2.f = T3.i))
    on (T1.b = T2.e))
where (T0.j=T3.g)

In particular, I don't understand the claim that the subqueries can
not be materialized because of the fact that the (j=g) and (b=e) predicates create a correlation between the levels.

I'm glad you mentioned the predicate (b=e). In the discussion I focused on (j=g) but as it turns out, it's actually (b=e) that makes the difference in this particular query. The logic for both queries is the same (w.r.t to the different nesting levels) but for this specific scenario, (b=e) is the predicate of importance.

I can't see any theoretical reason why the following table expression
couldn't be materialized:

   s as T2 left outer join r as T3 on (T2.f = T3.i)

This expression in and of itself can in fact be materialized. In this *isolated* query the predicate is between two tables that are at the same nesting level--namely, T2 and T3. So if all we look at is (T2.f=T3.i), then yes, as you say, there is no correlation between this expression and the rest of the FROM clause, and thus this piece can be safely materialized.

But in the bigger query, (T2.f=T3.i) is not the only predicate. We also have (T0.j=T3.g) and (T1.b=T2.e). And as I mentioned above, it's the predicate (T1.b=T2.e) which, for this example, renders materialization unsafe.

Furthermore, there is no correlation between:

   (t as T1 left outer join (s as T2 left outer join r as T3 on (T2.f
= T3.i)) on (T1.b = T2.e))

And the rest of the from clause (just TT in this case).

The correlation comes via the predicate (T0.j=T3.g), which is specified as part of the WHERE clause in the top-level query. That predicate joins TT with the HalfOuterJoinNode, hence the correlation. (Is that what you were asking?)

My understanding of the semantics of select are that (logically) the
FROM clause is executed first to produce a working table (in this case a CROSS JOIN of TT with the result of the left outer joins, which are evaluated independently of TT).

Logically, I think you are right here. Put differently, I think what you're expecting is that the HalfOuterJoin node should be evaluated to some result set, and then the hash join can occur between TT and that result set using (T0.j=T3.g) as the equijoin predicate.

Or, in the case of (T1.b=T2.e), JoinNode[4] should be evaluated to some result set and then the hash join can occur between T1 and that JoinNode.

I agree that logically it seems like this should work. But alas, it seems like something under the covers does not follow this logical separation of hash join vs. result-set-creation. Take, for example, the following tables and data, and then run the less ambiguous version of your query:

CREATE TABLE T (A INT NOT NULL, B DECIMAL(10,3) NOT NULL, C VARCHAR(5) NOT 
NULL);
INSERT INTO T VALUES (1, 1.0, '1'), (2, 2.0, '2'), (3, 3.0, '3');

CREATE TABLE S (D INT NOT NULL, E DECIMAL(10,3) NOT NULL, F VARCHAR(5) NOT 
NULL);
INSERT INTO S VALUES (2, 2.0, '2'), (3, 3.0, '3'), (4, 4.0, '4');

CREATE TABLE R (G INT NOT NULL, H DECIMAL(10,3) NOT NULL, I VARCHAR(5) NOT 
NULL);
INSERT INTO R VALUES (3, 3.0, '3'), (4, 4.0, '4'), (5, 5.0, '5');

CREATE TABLE TT (J INT NOT NULL, K DECIMAL(10,3) NOT NULL, L VARCHAR(5) NOT 
NULL);
INSERT INTO TT VALUES (2, 2.0, '2'), (3, 3.0, '3'), (4, 4.0, '4');

Currently Derby will (correctly) return one row:

ij> select t0.*
 from
    tt as T0,
    (t as T1 left outer join
     (s as T2 left outer join r as T3 on (T2.f = T3.i))
     on (T1.b = T2.e))
 where (T0.j=T3.g);
J          |K           |L
------------------------------
3          |3.000       |3

But without the changes for DERBY-781 that led to this discussion, the query returns zero rows. The difference between the two plans is that, in the latter, the optimizer does a hash join between T1 and JoinNode[4], but in the former it does not.

So given that I came to the conclusion that hash joins are unsafe if the predicate references FromTables at different nesting levels. And I think that conclusion is still correct based on the Derby behavior.

I cannot, however, say for certain *why* such a hash join is unsafe. I think you're right in asking this question and I'm glad you did...but I don't have a good answer. Perhaps something is in fact amiss somewhere?

My thinking (fwiw): the in-memory hash table for non-base-tables is represented by a HashTableNode that is created in the "modifyAccessPaths()" method of ProjectRestrictNode. At generation time the join predicate is generated as a qualifier for the HashTableNode's child--i.e. for JoinNode[4] in our example. (See the generateMinion() method in HashTableNode). My understanding is that, at execution time, the JoinNode[4] result set is materialized and probed with the first row in T1. Then for subsequent rows in T1, the materialized result set is simply probed with the appropriate column value from the T1 row.

If that's correct, I wonder if HashTableNode's qualifier is somehow being evaluated incorrectly against the materialization of JoinNode[4], leading to missing rows? Unfortunately, I don't know what the answer to that question is...

So long story short: theoretically it seems like hash joins between FromTables at different nesting levels should be possible. But in actuality such joins can lead to incorrect results. I addressed this by disallowing such hash joins as described in the DERBY-781 document. But you're right, maybe this is something worth investigating more...?

Army

Reply via email to