Hi all, During vote of 1.23.0 rc1, there are some discussions about the solution of CALCITE-3997.
Some may think it is a matter of best practice, but in many cases it is a matter of right or wrong. Here is the rationale. The original query in CALCITE-3997 is SELECT t1.k1, t2.k1 FROM t1, t2 ON t1.k1=t2.k1 WHERE t1.n1 + 1 = t2.n3 After FilterIntoJoin and MergeJoinRule, planner generates an alternative: MergeJoin on t1.k1=t2.k1 and t1.n1 + 1 = t2.n3 The mergejoin's collation is [k1]. This mergejoin matches JoinPushExpressionsRule, which pushes expression in join condition, and generates a Project in child. Project t1.k1, t2.k1 -> MergeJoin on t1.k1=t2.k1 and t1.col = t2.n3 -> Project k1, n1+1 as col -> TableScan t1 -> TableScan t2 The new mergejoin now has 2 equi-conditions, but the collation is the same with old mergejoin, still [k1], which is wrong. It should be updated to [k1,col] and [k1, n3]. Hence error. The similar issue already happened before, see CALCITE-3353. And it is very easy to trigger same error using different query even without FilterIntoJoin. If we change WHERE to AND in the query, SELECT t1.k1, t2.k1 FROM t1, t2 ON t1.k1=t2.k1 AND t1.n1 + 1 = t2.n3 or with the following query with join condition const reductions on, we can still see the issue. SELECT t1.k1, t2.k1 FROM t1, t2 ON t1.k1=t2.k1 AND (t1.n1 + 1 = t2.n3 or t1.n1 + 1 = t2.n3) Can we use the similar approach that in CALCITE-3353 to fix the issue? We can, but it just hide the issue. Doing that way can generate mergejoin with correct traitset, but the rule is still generating an invalid plan. Because the MergeJoin's new input is a LogicalProject, which is generated by RelBuilder. In Calcite, RelNode is a very special class. Unlike operators in Cascades or Volcano paper, where logical/physical operator and plan expression are separated, when constructing logical/physical expression tree, each operator doesn't require anything trait from its child. But RelNode in Calcite not only represents operator, but also plan expression, and can even represent a MEMO group with required traitset, which is RelSubset. When constructing a RelNode XXX, we need to specify the input RelNodes, the input nodes' traitsets will become node XXX's requirement on those inputs, even we are not aware that we are requesting it, which is extremely error-prone. In above case, the new MergeJoin's left input is a LogicalProject, which means the physical MergeJoin is requesting its left input to be logical operator with NONE convention, which has {inf} large cost. That means the newly generated MergeJoin is never implementable. What if we have a physical RelBuilder to generate the Project in the rule? The consequence is worse. With logical RelBuilder, it just generates an invalid plan, we are still safe because it won't be selected as the best plan. If we get a physical RelBuilder, which generates a physical plan that is implementable, but this is a wrong plan. Because in RelOptUtil.pushDownJoinConditions, when creating the MergeJoin, it just specifies its traitset and inputs, but it doesn't request the correct collation on the MergeJoin's inputs. The Project's collation is EMPTY, that means the MergeJoin doesn't require any collation on the Project. Apparently this will be a wrong plan. This kind of issue doesn't limit to JoinPushExpressionsRule, any rule that has Join as operand may have the same issue. In fact, all rules that has Aggregate as operand may suffer the same issue. Because in 1.23.0, we add a new physical operator EnumerableSortedAggregate, which requires its input to be sorted by group keys. AggregateProjectMergeRule can generate wrong plan if we apply it on EnumerableSortedAggregate. In addition, we are now discussing about adding a new EnumerableMergeUnion, to require the UNION's input sorted. Maybe in next few releases, we will have sort based EnumerableWindow operator. Finally we may find that only ProjectMerge, FilterMerge, CalcMerge can work correctly with physical operators, because these operators doesn't require any trait from input, so it just happens to work. The change of TransformationRule looks risky at first glance, however, it makes those rules less risky. Some thinks we should provide different rule instances / types to match logical/physical operators. But IMO, these built-in transformation rules shouldn't match physical operators at all in VolcanoPlanner, because most, if not all, of them can't worked correctly with physical operators. Even for Project/Filter/Calc merge rules, all the merge work can be done through logical operators, applying on physical operators will duplicate the process again. Even if we just view this as a best practice, and we (as Calcite project members) don't follow the best practice on Calcite's built-in EnumerableConvention, how can we tell downstream projects to follow the best practices? Is there any document stating the best practices? Most of them just follow what Calcite does. Just like children follow the example of their parents. If we checkout the code of Apache Druid, Apache Beam, we will find that they are doing the similar way as Calcite's Enumerable Convention. Beam even adds JoinCommuteRule, JoinAssociateRule, JoinPushThroughJoinRule.RIGHT, JoinPushThroughJoinRule.LEFT all together in its ruleset. I guess join performance is not a problem for BEAM. Finally, I don't think CalcMergeRule should be an exception, nothing is impossible without physical CalcMergeRule. Thanks, Haisheng