Hi all,

During vote of 1.23.0 rc1, there are some discussions about the solution of 
CALCITE-3997. 

Some may think it is a matter of best practice, but in many cases it is a 
matter of right or wrong.

Here is the rationale.

The original query in CALCITE-3997 is 

SELECT t1.k1, t2.k1 FROM t1, t2
ON t1.k1=t2.k1
WHERE t1.n1 + 1 = t2.n3

After FilterIntoJoin and MergeJoinRule, planner generates an alternative:
MergeJoin on t1.k1=t2.k1 and t1.n1 + 1 = t2.n3

The mergejoin's collation is [k1].

This mergejoin matches JoinPushExpressionsRule, which pushes expression in join 
condition, and generates a Project in child.
Project t1.k1, t2.k1
  -> MergeJoin on t1.k1=t2.k1 and t1.col = t2.n3
       -> Project k1, n1+1 as col
            -> TableScan t1
       -> TableScan t2

The new mergejoin now has 2 equi-conditions, but the collation is the same with 
old mergejoin, still [k1], which is wrong.
It should be updated to [k1,col] and [k1, n3]. Hence error.

The similar issue already happened before, see CALCITE-3353. And it is very 
easy to trigger same error using different query even without FilterIntoJoin. 
If we change WHERE to AND in the query,
SELECT t1.k1, t2.k1 FROM t1, t2
ON t1.k1=t2.k1
AND t1.n1 + 1 = t2.n3

or with the following query with join condition const reductions on, we can 
still see the issue.
SELECT t1.k1, t2.k1 FROM t1, t2
ON t1.k1=t2.k1
AND (t1.n1 + 1 = t2.n3 or t1.n1 + 1 = t2.n3)

Can we use the similar approach that in CALCITE-3353 to fix the issue?
We can, but it just hide the issue. Doing that way can generate mergejoin with 
correct traitset, but the rule is still generating an invalid plan.
Because the MergeJoin's new input is a LogicalProject, which is generated by 
RelBuilder.

In Calcite, RelNode is a very special class. Unlike operators in Cascades or 
Volcano paper, where logical/physical operator and plan expression are 
separated, when constructing logical/physical expression tree, each operator 
doesn't require anything trait from its child. But RelNode in Calcite not only 
represents operator, but also plan expression, and can even represent a MEMO 
group with required traitset, which is RelSubset. When constructing a RelNode 
XXX, we need to specify the input RelNodes, the input nodes' traitsets will 
become node XXX's requirement on those inputs, even we are not aware that we 
are requesting it, which is extremely error-prone.

In above case, the new MergeJoin's left input is a LogicalProject, which means 
the physical MergeJoin is requesting its left input to be logical operator with 
NONE convention, which has {inf} large cost. That means the newly generated 
MergeJoin is never implementable.

What if we have a physical RelBuilder to generate the Project in the rule? The 
consequence is worse.
With logical RelBuilder, it just generates an invalid plan, we are still safe 
because it won't be selected as the best plan. If we get a physical RelBuilder, 
which generates a physical plan that is implementable, but this is a wrong 
plan. Because in RelOptUtil.pushDownJoinConditions, when creating the 
MergeJoin, it just specifies its traitset and inputs, but it doesn't request 
the correct collation on the MergeJoin's inputs. The Project's collation is 
EMPTY, that means the MergeJoin doesn't require any collation on the Project. 
Apparently this will be a wrong plan.

This kind of issue doesn't limit to JoinPushExpressionsRule, any rule that has 
Join as operand may have the same issue. In fact, all rules that has Aggregate 
as operand may suffer the same issue. Because in 1.23.0, we add a new physical 
operator EnumerableSortedAggregate, which requires its input to be sorted by 
group keys. AggregateProjectMergeRule can generate wrong plan if we apply it on 
EnumerableSortedAggregate. In addition, we are now discussing about adding a 
new EnumerableMergeUnion, to require the UNION's input sorted. Maybe in next 
few releases, we will have sort based EnumerableWindow operator. Finally we may 
find that only ProjectMerge, FilterMerge, CalcMerge can work correctly with 
physical operators, because these operators doesn't require any trait from 
input, so it just happens to work. The change of TransformationRule looks risky 
at first glance, however, it makes those rules less risky.

Some thinks we should provide different rule instances / types to match 
logical/physical operators. But IMO, these built-in transformation rules 
shouldn't match physical operators at all in VolcanoPlanner, because most, if 
not all, of them can't worked correctly with physical operators. Even for 
Project/Filter/Calc merge rules, all the merge work can be done through logical 
operators, applying on physical operators will duplicate the process again.

Even if we just view this as a best practice, and we (as Calcite project 
members) don't follow the best practice on Calcite's built-in 
EnumerableConvention, how can we tell downstream projects to follow the best 
practices? Is there any document stating the best practices? Most of them just 
follow what Calcite does. Just like children follow the example of their 
parents. If we checkout the code of Apache Druid, Apache Beam, we will find 
that they are doing the similar way as Calcite's Enumerable Convention. Beam 
even adds JoinCommuteRule, JoinAssociateRule, JoinPushThroughJoinRule.RIGHT, 
JoinPushThroughJoinRule.LEFT all together in its ruleset. I guess join 
performance is not a problem for BEAM.

Finally, I don't think CalcMergeRule should be an exception, nothing is 
impossible without physical CalcMergeRule.

Thanks,
Haisheng


Reply via email to