Hi Haisheng,

I am trying to model the proposal based on ideas from Cascades. This
assumes top-down control of the optimization process, i.e. the parent
drives optimization of the child. But as a result of this top-down
propagation of optimization requests, the implementation rules are applied
bottom-up in a depth-first manner (aka "backward chaining"), resembling
HepMatchOrder.DEPTH_FIRST. This perhaps the most profound difference wrt
VolcanoPlanner, where the optimization process is not driven by parents. In
this case, we will never end up in a situation when the children nodes are
not implemented.

Let me try showing another example of pseudo-code, demonstrating the idea.

1) The optimization process starts with the initial request to the planner.
We generate initial equivalence sets for all the nodes.
Then we create the optimization request. Note that optimization requests
may be more complex, than "satisfy the given traits". E.g. for merge join
it could be "give me nodes with non-empty collation". Hence the need for a
class OptimizationRequest. Then we start the optimization from the root,
with an infinite cost.

RelNode optimize(RelNode root, RelTraitSet traits) {
    RelSet rootSet = register(root);

    OptimizationRequest req = OptimizationRequest.satisfies(traits);

    List<RelNode> nodes = optimizeSet(rootSet, req, Cost.INFINITE);

    return minimalCost(nodes);
}

2) This is the entry point for the optimization of a single
equivalence set. Returns the best plans for different trait sets.
First, we get cached result if possible. The optimization is performed only
if the result is absent for the given request or if the cached result
exceeds the maxCost. Then we generate the list of logical alternatives. For
most nodes, there will be only one alternative. The main exception to the
rule is join optimization. Note that we pass optimization req and cost even
for logical optimization, because logical optimization may require physical
implementations! An example is a bushy join planning where we consider a
small fraction of bushy plans based on input distributions, which requires
physical inputs (see MemSQL and SQL Server PDW papers). Next, we perform
physical optimization of logical alternatives, registering good plans in
the equivalence set along the way. Finally, we get the best plans for the
given request, one per trait set.

List<RelNode> optimizeSet(RelSet equivalenceSet, OptimizationRequest req,
Cost maxCost) {
    List<RelNode> cachedResult = cachedResults.get(req);

    if (cachedResult != null && cachedResult.getCost() <= maxCost) {
        return cachedResult.
    }

    Set<RelNode> logicalNodes = optimizeLogical(eqiuvalenceSet, req,
maxCost);

    for (Rel loglcalNode : logicalNodes) {
        optimizePhysical(logicalNode, req, maxCost);
    }

    List<RelNode> result = equivalenceSet.getBestResults(req, maxCost);

    if (result != null) {
        cachedResults.put(req, result);
    }

    return result;
}

3) The logical optimization process creates the list of optimization rules
and fires them one by one. Aggressive caching and pruning are used here
(omitted for brevity). Rule execution knows the optimization context (req,
maxCost), so it could plan the optimization flow accordingly. Finally,
qualifying logical nodes are returned. Note that we do not use any global
"rule queue" here. The optimization process is fully under our control, and
every rule has a well-defined optimization context in which it is called.

Set<RelNode> optimizeLogical(RelSet equivalenceSet, OptimizationRequest
req, Cost maxCost) {
    List<RelOptRule> rules = createLogicalRules(equivalenceSet);

    for (RelOptRule rule : rules) {
        rule.fire(req, maxCost);
    }

    return equivalenceSet.getLogicalRels(req, maxCost);
}

4) Physical optimization. Caching is omitted for brevity. Here we invoke
the implementation rules to produce physical nodes. This may include
enforcers, which are essentially a special flavor of implementation rule.

void optimizePhysical(RelNode logicalNode, OptimizationRequest req, Cost
maxCost) {
    List<PhysicalRule> rules = createPhysicalRules(logicalNode);

    for (ImplementationRule rule : rules) {
        rule.fire(logicalNode, req, maxCost);
    }
}

5) An example of a HashJoin rule which accepts the optimization request.
Comments are inlined.

class HashJoinRule implements PhysicalRule {
    @Override
    void fire(LogicalNode logicalJoin, OptimizationRequest req, Cost
maxCost) {
        // Get the minimal self cost of all physical joins.
        Cost logicalCost = logicalJoin.getCost();

        // Prepare optimization requests for the left and right parts based
on the parent request.
        OptimizationRequest leftReq = splitOptimizationRequest(req, true /*
left */);
        OptimizationRequest rightReq = splitOptimizationRequest(req, false
/* right */);

        // Recursive call to the function from p.2, exploring the left
child. The cost is adjusted.
        List<RelNode> leftNodes = optimizeSet(join.getInput(0), leftReq,
maxCost - logicalCost);

        for (RelNode leftNode : leftNodes) {
            // Recursive call to the function from p.2, exploring the right
child. The cost is adjusted even more aggressively.
            List<RelNode> rightNodes = optimizeSet(join.getInput(1),
rightReq, maxCost - logicalCost - leftNode.getCost());

            for (RelNode rightNode : rightNodes) {
                // Create physical rel.
                RelNode physicalJoin = implementHashJoin(logicalJoin,
leftNode, rightNode);

                // One more attempt to prune.
                if (physicalJoin.getCost() <= maxCost) {
                    // Ok, this is a good physical plan. Register in the
equivalence set.
                    logicalJoin.getEquivalenceSet().register(physicalJoin);
                }
            }
        }
    }
}

I understand this pseudo-code might be hard to follow, but unfortunately, I
do not have any working prototype at the moment. The key takeaways:
1) The optimization process is guided by parents, the context is always
preserved
2) Aggressive caching of already executed requests. Note that this is not
MEMO, but the additional cache to prevent excessive rule executions
3) Aggressive pruning - the cost is propagated top-down
4) Finally, it closely follows your idea of on-demand traits - we really
demand traits from children nodes.  But this proposal also propagates costs
to allow for branch-and-bound, and also optimize children nodes while
pulling up their traits, thus saving optimization time.

I would appreciate the community feedback since I still feel that I miss
some important details.

Regards,
Vladimir.

вт, 10 дек. 2019 г. в 04:45, Haisheng Yuan <h.y...@alibaba-inc.com>:

> Hi Vladimir,
>
> Sorry for my late reply.
> WRT join planning, it is not required to put join reordering rule into the
> HEP planner. It can also be put into Volcano planner. Indeed, it is not
> ideal for the join ordering rule to generate a single plan. We can create a
> nother rule to generate multiple alternatives and put the rule into Volcano
> planner. This way you can get what you want.
>
> The pull-up trait is not the essence of on-demand trait request, the main
> idea is link [1].
>
> >> 4.1) If the logical cost exceeds "maxCost", we stop and return. The
> whole logical subspace is pruned even before exploration.
> In many cases, the search space you pruned is just the specific operator,
> because the child operator should be a MEMO group, other parent operators
> might still need to explore it, especially when the JoinReorderingRule only
> generate a single logical optimal join order.
>
> >> 4.2) Returned physical children are already registered in proper
> set/subset, but are not used for any pattern-matching, and doesn't trigger
> more rule calls!
> That is the problem of Calcite's default behaviour. Most of the rules'
> default INSTANCE provided by Calcite not only match logical operators but
> also physical operators. I am against that. I am not sure if you have
> created your own rule instances or not.
>
> >> 4.3) Implementation rule checks the cost of the physical child.
> During implementation rule, it is possiple that we are not able to
> calculate the cost yet. Depending on the rule match order, if it is
> top-down rule matching, the child operators are still logical. If it is
> bottom-up rule matching, the child operators are still not enforced, say we
> generate a MergeJoin with 2 children not sorted yet, how do we estimate the
> cost?
>
> >> If it is greater than any other already observed child with the same
> traits
> How can we observe it inside the implementation rule?
>
> [1]
> http://mail-archives.apache.org/mod_mbox/calcite-dev/201910.mbox/%3cd75b20f4-542a-4a73-897e-66ab426494c1.h.y...@alibaba-inc.com%3e
>
> - Haisheng
>
> ------------------------------------------------------------------
> 发件人:Vladimir Ozerov<ppoze...@gmail.com>
> 日 期:2019年12月06日 18:00:01
> 收件人:Haisheng Yuan<h.y...@alibaba-inc.com>
> 抄 送:dev@calcite.apache.org (dev@calcite.apache.org)<dev@calcite.apache.org
> >
> 主 题:Re: Re: Re: Volcano's problem with trait propagation: current state
> and future
>
> "all we know is their *collations*" -> "all we know is their *traits*"
>
> пт, 6 дек. 2019 г. в 12:57, Vladimir Ozerov <ppoze...@gmail.com>:
>
>> Hi Haisheng,
>>
>> Thank you for your response. Let me elaborate my note on join planning
>> first - what I was trying to say is not that rules on their own have some
>> deficiencies. What I meant is that with current planner implementation,
>> users tend to separate join planning from the core optimization process
>> like this in the pseudo-code below. As a result, only one join permutation
>> is considered during physical planning, even though join rule may
>> potentially generate multiple plans worth exploring:
>>
>> RelNode optimizedLogicalNode = doJoinPlanning(logicalNode);
>> RelNode physicalNode = doPhysicalPlanning(optimizedLogicalNode);
>>
>> Now back to the main question. I re-read your thread about on-demand
>> trait propagation [1] carefully. I'd like to admit that when I was reading
>> it for the first time about a month ago, I failed to understand some
>> details due to poor knowledge of different optimizer architectures. Now I
>> understand it much better, and we definitely concerned with exactly the
>> same problem. I feel that trait pull-up might be a step in the right
>> direction, however, it seems to me that it is not the complete solution.
>> Let me try to explain why I think so.
>>
>> The efficient optimizer should try to save CPU as much as possible
>> because it allows us to explore more plans in a sensible amount of time. To
>> achieve that we should avoid redundant operations, and detect and prune
>> inefficient paths aggressively. As far as I understand the idea of trait
>> pull-up, we essentially explore the space of possible physical properties
>> of children nodes without forcing their implementation. But after that, the
>> Calcite will explore that nodes again, now in order to execute
>> implementation rules. I.e. we will do two dives - one to enumerate the
>> nodes (trait pull-up API), and the other one to implement them
>> (implementation rules), while in Cascades one dive should be sufficient
>> since exploration invokes the implementation rules as it goes. This is the
>> first issue I see.
>>
>> The second one is more important - how to prune inefficient plans?
>> Currently, nodes are implemented independently and lack of context doesn't
>> allow us to estimates children's costs when implementing the parent, hence
>> branch-and-bound is not possible. Can trait pull-up API "List<RelTraitSet>
>> deriveTraitSets(RelNode, RelMetadataQuery)" help us with this? If the
>> children nodes are not implemented before the pull-up, all we know is their
>> collations, but not their costs. And without costs, pruning is not
>> possible. Please let me know if I missed something from the proposal.
>>
>> The possible architecture I had in mind after reading multiple papers,
>> which may answer all our questions, could look like this:
>> 1) We have a queue of nodes requiring optimization. Not a queue of rules.
>> initial queue state is formed from the initial tree, top-down.
>> 2) The node is popped from the queue, and we enter
>> "node.optimize(maxCost)" call. It checks for matching rules, prioritizes
>> them, and start their execution on by one. Execution of rules may re-insert
>> the current node into the queue, in which case this step is repeated,
>> possibly many times
>> 3) Logical-logical rules (transformations) produce new logical nodes and
>> put them into the queue for further optimization
>> 4) Logical-physical rules (implementation) do the following:
>> 4.1) Costs of logical children are estimated. The cost of a logical node
>> should be less than any cost of a possible physical node that may be
>> produced out of it. If the logical cost exceeds "maxCost", we stop and
>> return. The whole logical subspace is pruned even before exploration.
>> 4.2) Recursively call "childNode.optimize(maxCost - currentLogicalCost)"
>> method, which returns a set of possible physical implementations of a
>> child. Returned physical children are already registered in proper
>> set/subset, but are not used for any pattern-matching, and doesn't trigger
>> more rule calls!
>> 4.3) Implementation rule checks the cost of the physical child. If it is
>> greater than any other already observed child with the same traits, or
>> exceeds the "maxCost", it is discarded. Otherwise, the physical
>> implementation of the current node is produced and registered in the
>> optimizer.
>>
>> The pseudocode for physical implementation flow for join (two inputs):
>>
>> Collection<RelNode> optimizePhysical(Cost maxCost) {
>>     // Estimated minimal self-cost. Any physical implementation of this
>> node should have greater self-cost
>>     Cost logicalSelfCost = optimizer.getCost(this);
>>
>>     // *Pruning #1*: whatever children we implement, the total cost will
>> be greater than the passed maxCost, so do not explore further
>>     Cost maxChildCost = maxCost - logicalSelfCost;
>>
>>     Cost logicalILeftCost = optimizer.getCost(leftLogicalNode);
>>     Cost logicalRightCost = optimizer.getCost(rightLogicalNode);
>>
>>     if (logicalLeftCost + logicalRightCost > maxChildCost) {
>>         return;
>>     }
>>
>>     // This is our equivalence set.
>>     RelSet equivalenceSet = this.getSet();
>>
>>     // Get promising physical implementations of child nodes recursively
>>     List<RelNode> leftPhysicalNodes =
>> leftLogicalNode.optimizePhysical(maxChildCost);
>>     List<RelNode> rightPhysicalNodes =
>> rightLigicalNode.optimizePhysical(maxChildCost);
>>
>>     for (RelNode leftPhysicalNode : leftPhysicalNodes) {
>>         for (RelNode rightPhysicalNode : rightPhysicalNodes) {
>>             // *Pruning #2*: Combination of physical input costs is
>> already too expensive, give up
>>             Cost physicalLeftCost = optimizer.getCost(leftPhysicalNode);
>>             Cost physicalRightCost = optimizer.getCost(rightPhysicalNode);
>>
>>             if (logicalILeftCost + logicalRightCost > maxChildCost) {
>>                 continue.
>>             }
>>
>>             // Implement possible physical nodes for the given pair of
>> inputs (maybe more than one)
>>             List<RelNode> physicalJoins = implement(leftPhysicalNode,
>> rightPhysicalNode);
>>
>>             for (RelOptRule physicalJoin : physicalJoins) {
>>                // *Pruning #3*: Do not consider implementation if we
>> have another one with the same trait set and smaller cost)
>>                 Cost physicalJoinCost = optimizer.getCost(physicalJoin);
>>                 Cost bestCostForTraitSet =
>> equivalenceSet.getBestCost(physicalJoin.getTraitSet());
>>
>>                 if (physicalJoinCost > bestCostForTraitSet) {
>>                     continue.
>>                 }
>>
>>                 // This is a good implementation. Register it in the set,
>> updating per-traitset best costs.
>>                 equivalenceSet.register(physicalJoin);
>>             }
>>         }
>>     }
>>
>>     // Return the best registered expressions with different traitsets
>> from the current set.
>>     return equivalenceSet.getBestExps();
>> }
>>
>> This is a very rough pseudo-code, only to demonstrate the basic idea on
>> how proper bottom-up propagation not only helps us set proper traits for
>> the new physical node but also ensures that not optimal plans are pruned as
>> early as possible. Real implementation should be better abstracted and
>> accept enforcers as well.
>>
>> Also, please notice that the pseudo-code doesn't show when logical rules
>> are fired. This is a separate question. One possible straightforward way is
>> to add the aforementioned physical routine to normal Volcano flow:
>> 1) Fire logical rule on a node and register new nodes
>> 2) Fire physical optimization as shown above, then invoke
>> "call.transformTo()" for every returned physical rel
>> 3) Re-trigger the process for newly created nodes and their parents
>>
>> A better approach is to interleave logical and physical optimizations, so
>> they trigger each other recursively. But this would require a serious
>> redesign of a "rule queue" concept.
>>
>> Does it have any common points with your proposal?
>>
>> Regards,
>> Vladimir.
>>
>> [1]
>> https://ponymail-vm.apache.org/_GUI_/thread.html/79dac47ea50b5dfbd3f234e368ed61d247fb0eb989f87fe01aedaf25@%3Cdev.calcite.apache.org%3E
>>
>>
>> пт, 6 дек. 2019 г. в 05:41, Haisheng Yuan <h.y...@alibaba-inc.com>:
>>
>>> Oh, I forgot to mention that the join planning/reordering is not a big
>>> problem. Calcite just use the rule to generate a single alternative plan,
>>> which is not ideal.  But we can't say Calcite is doing wrong.
>>>
>>> Ideally, we expect it generates multiple (neither all, nor single)
>>> bipartie graphs. The join reordering rule will cut each part into bipartie
>>> recursively and apply JoinCommutativity rule to generate more alternatives
>>> for each RelSet. It is just a different strategy. We can modify the rule,
>>> or create new join reordering rule to generate multiple plan
>>> alternatives.
>>>
>>> - Haisheng
>>>
>>> ------------------------------------------------------------------
>>> 发件人:Haisheng Yuan<h.y...@alibaba-inc.com>
>>> 日 期:2019年12月06日 09:07:43
>>> 收件人:Vladimir Ozerov<ppoze...@gmail.com>; dev@calcite.apache.org (
>>> dev@calcite.apache.org)<dev@calcite.apache.org>
>>> 主 题:Re: Re: Volcano's problem with trait propagation: current state and
>>> future
>>>
>>> Generally agree with what Vladimir said. I think what Calcite has is
>>> logical optimization or exploration, there are almost no physical
>>> optimization, Calcite leaves it to third party implementators. One of my
>>> friends at University of Wisconsin–Madison database research group told me
>>> that they gave up the idea of using Calcite in their project due to this
>>> reason.
>>>
>>> Currently physical properties are requested in implementation rules, or
>>> even logical exploration rules, But each rule is independent, the
>>> pattern-matched expression is not aware of what does the parent operator
>>> want. Using AbstractConverter doesn't help, and is not promising.
>>>
>>> >> You shouldn't regiester all logical rules in the planner
>>> simultaneously,... as Drill does.
>>> That is because Calcite does too many redundant or duplicate rule
>>> matching, like all kinds of transpose (can't be avoided due to current
>>> design), matching physical operators.
>>>
>>> >> decoupling the logical planning from the physical one looks
>>> a bit weird to me because it violates the idea of Cascades framework.
>>> Orca optimizer fully adopted the design principle of Cascades framework
>>> except that it separates into 3 phases: logical exploration, physical
>>> implementation, and optimization (property enforcing). And it might be
>>> easier if we want to enable parallel optimization by seperating into 3
>>> phases. Orca does branch-and-bound in optimization phase, before actual
>>> property derivation and enforcement, IIRC. It is highly efficient, works
>>> pretty well, and battlefield-tested by many large financial and insurance
>>> companies.
>>>
>>> In my last thread about on-demand trait request, I gave the high-level
>>> general API for physical operators to derive and require physical
>>> properties, which is similar to Orca's design. But seems like the proposal
>>> of API change gets no love.
>>>
>>> - Haisheng
>>>
>>> ------------------------------------------------------------------
>>> 发件人:Vladimir Ozerov<ppoze...@gmail.com>
>>> 日 期:2019年12月05日 22:22:43
>>> 收件人:dev@calcite.apache.org (dev@calcite.apache.org)<
>>> dev@calcite.apache.org>
>>> 主 题:Re: Volcano's problem with trait propagation: current state and
>>> future
>>>
>>> AbstractConverter-s are attractive because they effectively emulate
>>> straightforward recursive top-down optimization (Volcano/Cascades). But
>>> instead of doing it with a recursive method call, which preserves the
>>> context, we do this in Calcite as a sequence of unrelated rule calls,
>>> thus
>>> losing the context. So with my current understanding, it could be thought
>>> of not as a search space explosion, but rather than the inefficient
>>> implementation of an otherwise straightforward parent->child->parent
>>> navigation, since we achieve this navigation indirectly through the rule
>>> queue, rather than through a normal method call. In any case, the net
>>> result is wasted CPU. Perhaps this is not exponential waste, but some
>>> multiplication of otherwise optimal planning. As I mentioned, in our
>>> experiments with TPC-H, we observed a constant factor between 6-9x
>>> between
>>> the number of joins and the number of join implementation rule
>>> invocations.
>>> It doesn't growth past 9 even for complex queries, so I hope that this is
>>> not an exponent :-)
>>>
>>> Speaking of logical vs physical optimization, IMO it makes sense in some
>>> cases. E.g. when doing predicate pushdown, you do not want to consider
>>> intermediate logical tree states for implementation, until the predicate
>>> reaches its final position. So running separate logical planning phase
>>> with
>>> Volcano optimizer makes total sense to me, because it effectively prunes
>>> a
>>> lot of not optimal logical plans before they reach the physical planning
>>> stage. The real problem to me is that we forced to remove join planning
>>> from the physical optimization stage. Because the goal of join planning
>>> not
>>> to generate a single optimal plan, like with predicate pushdown, but
>>> rather
>>> to generate a set of logical plans all of which should be implemented and
>>> estimated. And with AbstractConverter-s this is not possible because of
>>> their multiplicator increases the rate of search space growth, making
>>> join
>>> planning inapplicable even for the small number of relations. So we have
>>> to
>>> move them to the logical planning stage and pick only one permutation for
>>> physical planning.
>>>
>>>
>>> чт, 5 дек. 2019 г. в 15:35, Roman Kondakov <kondako...@mail.ru.invalid>:
>>>
>>> > Vladimir,
>>> >
>>> > thank you for bringing it up. We are facing the same problems in Apache
>>> > Ignite project
>>> > and it would be great if Apache Calcite community will propose a
>>> > solution for this
>>> > issue.
>>> >
>>> > From my point of view an approach with abstract converters looks more
>>> > promising, but as
>>> > you mentioned it suffers from polluting the search space. The latter
>>> can
>>> > be mitigated by
>>> > splitting a planning stage into the several phases: you shouldn't
>>> > register all logical rules in the planner simultaneously - it looks
>>> like
>>> > it is better to have several iterations of planning stage with
>>> different
>>> > sets of rules, as Drill does.
>>> >
>>> > Also I'd like to mention that decoupling the logical planning from the
>>> > physical one looks
>>> > a bit weird to me because it violates the idea of Cascades framework.
>>> > Possibly this decoupling is the consequence of some performance issues.
>>> >
>>> >
>>> > --
>>> > Kind Regards
>>> > Roman Kondakov
>>> >
>>> > On 05.12.2019 14:24, Vladimir Ozerov wrote:
>>> > > Hi,
>>> > >
>>> > > As I mentioned before, we are building a distributed SQL engine that
>>> uses
>>> > > Apache Calcite for query optimization. The key problem we faced is
>>> the
>>> > > inability to pull the physical traits of child relations
>>> efficiently. I'd
>>> > > like to outline my understanding of the problem (I guess it was
>>> already
>>> > > discussed multiple times) and ask the community to prove or disprove
>>> the
>>> > > existence of that problem and its severity for the products which
>>> uses
>>> > > Apache Calcite and ask for ideas on how it could be improved in the
>>> > future.
>>> > >
>>> > > I'll start with the simplified problem description and mentioned more
>>> > > complex use cases then. Consider that we have a logical tree and a
>>> set of
>>> > > implementation rules. Our goal is to find the optimal physical tree
>>> by
>>> > > applying these rules. The classical Cascades-based approach directs
>>> the
>>> > > optimization process from the top to the bottom (hence "top-down").
>>> > > However, the actual implementation of tree nodes still happens
>>> bottom-up.
>>> > > For the tree L1 <- L2, we enter "optimize(L1)", which recursively
>>> > delegates
>>> > > to "optimize(L2)". We then implement children nodes L1 <- [P2',
>>> P2''],
>>> > and
>>> > > return back to the parent, which is now able to pick promising
>>> > > implementations of the children nodes and reject bad ones with the
>>> > > branch-and-bound approach. AFAIK Pivotal's Orca works this way.
>>> > >
>>> > > The Apache Calcite is very different because it doesn't allow the
>>> > recursion
>>> > > so that we lose the context on which node requested the child
>>> > > transformation. This loss of context leads to the following problems:
>>> > > 1) The parent node cannot deduce it's physical properties during the
>>> > > execution of the implementation rule, because Calcite expects the
>>> > > transformation to be applied before children nodes are implemented.
>>> That
>>> > is
>>> > > if we are optimizing LogicalProject <- LogicalScan, we cannot set
>>> proper
>>> > > distribution and collation for the to be created PhysicalProject,
>>> because
>>> > > it depends on the distribution and collation of the children which
>>> is yet
>>> > > to be resolved.
>>> > > 2) The branch-and-bound cannot be used because it requires at least
>>> one
>>> > > fully-built physical subtree.
>>> > >
>>> > > As a result of this limitation, products which rely on Apache
>>> Calcite for
>>> > > query optimization, use one or several workarounds:
>>> > >
>>> > > *1) Guess the physical properties of parent nodes before logical
>>> children
>>> > > are implemented*
>>> > > *Apache Flink* uses this strategy. The strategy is bad because of the
>>> > > number of combinations of traits growth exponentially with the
>>> number of
>>> > > attributes in the given RelNode, so you either explode the search
>>> space
>>> > or
>>> > > give up optimization opportunities. Consider the following tree:
>>> > > LogicalSort[a ASC] <- LogicalFilter <- LogicalScan
>>> > > The optimal implementation of the LogicalFilter is
>>> > PhysicalFilter[collation=a
>>> > > ASC] because it may eliminate the parent sort. But such optimization
>>> > should
>>> > > happen only if we know that there is a physical implementation of
>>> scan
>>> > > allowing for this sort order, e.g. PhysicalIndexScan[collation=a
>>> ASC].
>>> > I.e.
>>> > > we need to know the child physical properties first. Otherwise we
>>> > fallback
>>> > > to speculative approaches. With the *optimistic* approach, we emit
>>> all
>>> > > possible combinations of physical properties, with the hope that the
>>> > child
>>> > > will satisfy some of them, thus expanding the search space
>>> exponentially.
>>> > > With the *pessimistic* approach, we just miss this optimization
>>> > opportunity
>>> > > even if the index exists. Apache Flink uses the pessimistic approach.
>>> > >
>>> > > *2) Use AbstractConverters*
>>> > > *Apache Drill* uses this strategy. The idea is to "glue" logical and
>>> > > physical operators, so that implementation of a physical child
>>> > re-triggers
>>> > > implementation rule of a logical parent. The flow is as follows:
>>> > > - Invoke parent implementation rule - it either doesn't produce new
>>> > > physical nodes or produce not optimized physical nodes (like in the
>>> > Apache
>>> > > Flink case)
>>> > > - Invoke children implementation rules and create physical children
>>> > > - Then converters kick-in and re-trigger parent implementation rule
>>> > through
>>> > > the creation of an abstract converter
>>> > > - Finally, the parent implementation rule is fired again and now it
>>> > > produces optimized node(s) since at least some of the physical
>>> > > distributions of children nodes are implemented.
>>> > >
>>> > > Note that this is essentially a hack to simulate the Cascades flow!
>>> The
>>> > > problem is that AbstractConverters increase the complexity of
>>> planning
>>> > > because they do not have any context, so parent rules are just
>>> > re-triggered
>>> > > blindly. Consider the optimization of the following tree:
>>> > > LogicalJoin <- [LogicalScan1, LogicalScan2]
>>> > > With the converter approach, the join implementation rule will be
>>> fired
>>> > at
>>> > > least 3 times, while in reality, one call should be sufficient. In
>>> our
>>> > > experiments with TPC-H queries, the join rule implemented that way is
>>> > > typically called 6-9 times more often than expected.
>>> > >
>>> > > *3) Transformations (i.e. logical optimization) are decoupled from
>>> > > implementation (i.e. physical optimization)*
>>> > > Normally, you would like to mix both logical and physical rules in a
>>> > single
>>> > > optimization program, because it is required for proper planning.
>>> That
>>> > is,
>>> > > you should consider both (Ax(BxC)) and ((AxB)xC) join ordering during
>>> > > physical optimization, because you do not know which one will
>>> produce the
>>> > > better plan in advance.
>>> > > But in some practical implementations of Calcite-based optimizers,
>>> this
>>> > is
>>> > > not the case, and join planning is performed as a separate HEP stage.
>>> > > Examples are Apache Drill and Apache Flink.
>>> > > I believe that lack of Cascades-style flow and branch-and-bound are
>>> among
>>> > > the main reasons for this. At the very least for Apache Drill, since
>>> it
>>> > > uses converters, so additional logical permutations will
>>> exponentially
>>> > > multiply the number of fired rules, which is already very big.
>>> > >
>>> > > Given all these problems I'd like to ask the community to share
>>> current
>>> > > thoughts and ideas on the future improvement of the Calcite
>>> optimizer.
>>> > One
>>> > > of the ideas being discussed in the community is "Pull-up Traits",
>>> which
>>> > > should allow the parent node to get physical properties of the
>>> children
>>> > > nodes. But in order to do this, you effectively need to implement
>>> > children,
>>> > > which IMO makes this process indistinguishable from the classical
>>> > recursive
>>> > > Cascades algorithm.
>>> > >
>>> > > Have you considered recursive transformations as an alternative
>>> solution
>>> > to
>>> > > that problem? I.e. instead of trying to guess or pull the physical
>>> > > properties of non-existent physical nodes, go ahead and actually
>>> > implement
>>> > > them directly from within the parent rule? This may resolve the
>>> problem
>>> > > with trait pull-up, as well as allow for branch-and-bound
>>> optimization.
>>> > >
>>> > > I would appreciate your feedback or any hints for future research.
>>> > >
>>> > > Regards,
>>> > > Vladimir.
>>> > >
>>> >
>>>
>>>
>>>
>

Reply via email to