James Synge wrote:
Army mentioned earlier the analysis he wrote for Derby-781 (Subquery
materialization Hash Join). I've read most of that,
Thanks for reading the DERBY-781 writeup. It's always nice to know other people
are double-checking the analysis and asking questions--especially with the
optimizer since, as you probably know by now, there's an endless supply of
analysis to do and questions to ask...
and am confused by his analysis of this select:
Q2. select 1 from tt, (t left outer join (s left outer join r on (f =
i)) on (b = e)) where (j=g)
Judging from the analysis, this query could be less ambiguously
expressed as:
select 1
from
tt as T0,
(t as T1 left outer join
(s as T2 left outer join r as T3 on (T2.f = T3.i))
on (T1.b = T2.e))
where (T0.j=T3.g)
In particular, I don't understand the claim that the subqueries can
not be materialized because of the fact that the (j=g) and (b=e)
predicates create a correlation between the levels.
I'm glad you mentioned the predicate (b=e). In the discussion I focused on
(j=g) but as it turns out, it's actually (b=e) that makes the difference in this
particular query. The logic for both queries is the same (w.r.t to the
different nesting levels) but for this specific scenario, (b=e) is the predicate
of importance.
I can't see any theoretical reason why the following table expression
couldn't be materialized:
s as T2 left outer join r as T3 on (T2.f = T3.i)
This expression in and of itself can in fact be materialized. In this
*isolated* query the predicate is between two tables that are at the same
nesting level--namely, T2 and T3. So if all we look at is (T2.f=T3.i), then yes,
as you say, there is no correlation between this expression and the rest of the
FROM clause, and thus this piece can be safely materialized.
But in the bigger query, (T2.f=T3.i) is not the only predicate. We also have
(T0.j=T3.g) and (T1.b=T2.e). And as I mentioned above, it's the predicate
(T1.b=T2.e) which, for this example, renders materialization unsafe.
Furthermore, there is no correlation between:
(t as T1 left outer join (s as T2 left outer join r as T3 on (T2.f
= T3.i)) on (T1.b = T2.e))
And the rest of the from clause (just TT in this case).
The correlation comes via the predicate (T0.j=T3.g), which is specified as part
of the WHERE clause in the top-level query. That predicate joins TT with the
HalfOuterJoinNode, hence the correlation. (Is that what you were asking?)
My understanding of the semantics of select are that (logically) the
FROM clause is executed first to produce a working table (in this
case a CROSS JOIN of TT with the result of the left outer joins,
which are evaluated independently of TT).
Logically, I think you are right here. Put differently, I think what you're
expecting is that the HalfOuterJoin node should be evaluated to some result set,
and then the hash join can occur between TT and that result set using
(T0.j=T3.g) as the equijoin predicate.
Or, in the case of (T1.b=T2.e), JoinNode[4] should be evaluated to some result
set and then the hash join can occur between T1 and that JoinNode.
I agree that logically it seems like this should work. But alas, it seems like
something under the covers does not follow this logical separation of hash join
vs. result-set-creation. Take, for example, the following tables and data, and
then run the less ambiguous version of your query:
CREATE TABLE T (A INT NOT NULL, B DECIMAL(10,3) NOT NULL, C VARCHAR(5) NOT
NULL);
INSERT INTO T VALUES (1, 1.0, '1'), (2, 2.0, '2'), (3, 3.0, '3');
CREATE TABLE S (D INT NOT NULL, E DECIMAL(10,3) NOT NULL, F VARCHAR(5) NOT
NULL);
INSERT INTO S VALUES (2, 2.0, '2'), (3, 3.0, '3'), (4, 4.0, '4');
CREATE TABLE R (G INT NOT NULL, H DECIMAL(10,3) NOT NULL, I VARCHAR(5) NOT
NULL);
INSERT INTO R VALUES (3, 3.0, '3'), (4, 4.0, '4'), (5, 5.0, '5');
CREATE TABLE TT (J INT NOT NULL, K DECIMAL(10,3) NOT NULL, L VARCHAR(5) NOT
NULL);
INSERT INTO TT VALUES (2, 2.0, '2'), (3, 3.0, '3'), (4, 4.0, '4');
Currently Derby will (correctly) return one row:
ij> select t0.*
from
tt as T0,
(t as T1 left outer join
(s as T2 left outer join r as T3 on (T2.f = T3.i))
on (T1.b = T2.e))
where (T0.j=T3.g);
J |K |L
------------------------------
3 |3.000 |3
But without the changes for DERBY-781 that led to this discussion, the query
returns zero rows. The difference between the two plans is that, in the latter,
the optimizer does a hash join between T1 and JoinNode[4], but in the former it
does not.
So given that I came to the conclusion that hash joins are unsafe if the
predicate references FromTables at different nesting levels. And I think that
conclusion is still correct based on the Derby behavior.
I cannot, however, say for certain *why* such a hash join is unsafe. I think
you're right in asking this question and I'm glad you did...but I don't have a
good answer. Perhaps something is in fact amiss somewhere?
My thinking (fwiw): the in-memory hash table for non-base-tables is represented
by a HashTableNode that is created in the "modifyAccessPaths()" method of
ProjectRestrictNode. At generation time the join predicate is generated as a
qualifier for the HashTableNode's child--i.e. for JoinNode[4] in our example.
(See the generateMinion() method in HashTableNode). My understanding is that,
at execution time, the JoinNode[4] result set is materialized and probed with
the first row in T1. Then for subsequent rows in T1, the materialized result
set is simply probed with the appropriate column value from the T1 row.
If that's correct, I wonder if HashTableNode's qualifier is somehow being
evaluated incorrectly against the materialization of JoinNode[4], leading to
missing rows? Unfortunately, I don't know what the answer to that question is...
So long story short: theoretically it seems like hash joins between FromTables
at different nesting levels should be possible. But in actuality such joins can
lead to incorrect results. I addressed this by disallowing such hash joins as
described in the DERBY-781 document. But you're right, maybe this is something
worth investigating more...?
Army