Hi, As I mentioned before, we are building a distributed SQL engine that uses Apache Calcite for query optimization. The key problem we faced is the inability to pull the physical traits of child relations efficiently. I'd like to outline my understanding of the problem (I guess it was already discussed multiple times) and ask the community to prove or disprove the existence of that problem and its severity for the products which uses Apache Calcite and ask for ideas on how it could be improved in the future.
I'll start with the simplified problem description and mentioned more complex use cases then. Consider that we have a logical tree and a set of implementation rules. Our goal is to find the optimal physical tree by applying these rules. The classical Cascades-based approach directs the optimization process from the top to the bottom (hence "top-down"). However, the actual implementation of tree nodes still happens bottom-up. For the tree L1 <- L2, we enter "optimize(L1)", which recursively delegates to "optimize(L2)". We then implement children nodes L1 <- [P2', P2''], and return back to the parent, which is now able to pick promising implementations of the children nodes and reject bad ones with the branch-and-bound approach. AFAIK Pivotal's Orca works this way. The Apache Calcite is very different because it doesn't allow the recursion so that we lose the context on which node requested the child transformation. This loss of context leads to the following problems: 1) The parent node cannot deduce it's physical properties during the execution of the implementation rule, because Calcite expects the transformation to be applied before children nodes are implemented. That is if we are optimizing LogicalProject <- LogicalScan, we cannot set proper distribution and collation for the to be created PhysicalProject, because it depends on the distribution and collation of the children which is yet to be resolved. 2) The branch-and-bound cannot be used because it requires at least one fully-built physical subtree. As a result of this limitation, products which rely on Apache Calcite for query optimization, use one or several workarounds: *1) Guess the physical properties of parent nodes before logical children are implemented* *Apache Flink* uses this strategy. The strategy is bad because of the number of combinations of traits growth exponentially with the number of attributes in the given RelNode, so you either explode the search space or give up optimization opportunities. Consider the following tree: LogicalSort[a ASC] <- LogicalFilter <- LogicalScan The optimal implementation of the LogicalFilter is PhysicalFilter[collation=a ASC] because it may eliminate the parent sort. But such optimization should happen only if we know that there is a physical implementation of scan allowing for this sort order, e.g. PhysicalIndexScan[collation=a ASC]. I.e. we need to know the child physical properties first. Otherwise we fallback to speculative approaches. With the *optimistic* approach, we emit all possible combinations of physical properties, with the hope that the child will satisfy some of them, thus expanding the search space exponentially. With the *pessimistic* approach, we just miss this optimization opportunity even if the index exists. Apache Flink uses the pessimistic approach. *2) Use AbstractConverters* *Apache Drill* uses this strategy. The idea is to "glue" logical and physical operators, so that implementation of a physical child re-triggers implementation rule of a logical parent. The flow is as follows: - Invoke parent implementation rule - it either doesn't produce new physical nodes or produce not optimized physical nodes (like in the Apache Flink case) - Invoke children implementation rules and create physical children - Then converters kick-in and re-trigger parent implementation rule through the creation of an abstract converter - Finally, the parent implementation rule is fired again and now it produces optimized node(s) since at least some of the physical distributions of children nodes are implemented. Note that this is essentially a hack to simulate the Cascades flow! The problem is that AbstractConverters increase the complexity of planning because they do not have any context, so parent rules are just re-triggered blindly. Consider the optimization of the following tree: LogicalJoin <- [LogicalScan1, LogicalScan2] With the converter approach, the join implementation rule will be fired at least 3 times, while in reality, one call should be sufficient. In our experiments with TPC-H queries, the join rule implemented that way is typically called 6-9 times more often than expected. *3) Transformations (i.e. logical optimization) are decoupled from implementation (i.e. physical optimization)* Normally, you would like to mix both logical and physical rules in a single optimization program, because it is required for proper planning. That is, you should consider both (Ax(BxC)) and ((AxB)xC) join ordering during physical optimization, because you do not know which one will produce the better plan in advance. But in some practical implementations of Calcite-based optimizers, this is not the case, and join planning is performed as a separate HEP stage. Examples are Apache Drill and Apache Flink. I believe that lack of Cascades-style flow and branch-and-bound are among the main reasons for this. At the very least for Apache Drill, since it uses converters, so additional logical permutations will exponentially multiply the number of fired rules, which is already very big. Given all these problems I'd like to ask the community to share current thoughts and ideas on the future improvement of the Calcite optimizer. One of the ideas being discussed in the community is "Pull-up Traits", which should allow the parent node to get physical properties of the children nodes. But in order to do this, you effectively need to implement children, which IMO makes this process indistinguishable from the classical recursive Cascades algorithm. Have you considered recursive transformations as an alternative solution to that problem? I.e. instead of trying to guess or pull the physical properties of non-existent physical nodes, go ahead and actually implement them directly from within the parent rule? This may resolve the problem with trait pull-up, as well as allow for branch-and-bound optimization. I would appreciate your feedback or any hints for future research. Regards, Vladimir.