Hi Haisheng,

This is an interesting topic but somehow in my mind I thought that this
mechanism is already in place.

When an operator (logical or physical) is created its traitset is
determined in bottom-up fashion using the create
static factory method present in almost all operators. In my mind this is
in some sense the applicability function
mentioned in [1].

Now during optimization we proceed in top-down manner and we request
certain traitsets from the operators.
If it happens and they contain already the requested traits nothing needs
to be done.

In your example when we are about to create the sort-merge join we can
check what traitsets are present in the inputs
and if possible request those. Can you elaborate a bit more why do we need
a new type of metadata?

Anyway if we cannot do it at the moment it makes sense to complete the
missing bits since what you are describing
was already mentioned in the original design of the Volcano optimizer [1].

"If a move to be pursued is the exploration of a normal query processing
algorithm such as merge-join, its cost is calculated by the algorithm's
cost function. The algorithm's applicability function determines the
physical properly vectors for the algorithms inputs, and their costs and
optimal plans are found by invoking FindBestPlan for the inputs. For some
binary operators, the actual physical properties of the inputs are not as
important as the consistency of physical properties among the inputs. For
example, for a sort-based implementation of intersection, i.e., an
algorithm very similar to merge-join, any sort order of the two inputs will
suffice as long as the two inputs are sorted in the same way. Similarly,
for a parallel join, any partitioning of join inputs across multiple
processing nodes is acceptable if both inputs are partitioned using
Compatible partitioning rules. For these cases, the search engine permits
the optimizer implementor to specify a number of physical property vectors
to be tried. For example, for the intersection of two inputs R and S with
attributes A, B, and C where R is sorted on (A,B,C) and S is sorted on
(B,A,C), both these sort orders can be specified by the optimizer
implementor and will be optimized by the generated optimizer, while other
possible sort orders, e.g., (C,B,A), will be ignored. " [1]

Best,
Stamatis

[1]
https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/Papers/Volcano-graefe.pdf

On Fri, Oct 18, 2019 at 4:56 AM Haisheng Yuan <h.y...@alibaba-inc.com>
wrote:

> TL;DR
> Both top-down physical TraitSet request and bottom-up TraitSet
> derivation have their strongth and weakness, we propose
> on-demand TraitSet request to combine the above two, to reduce
> the number of plan alternatives that are genereated, especially
> in distributed system.
>
> e.g.
> select * from foo join bar on f1=b1 and f2=b2 and f3=b3;
>
> In non-distributed system, we can generate a sort merge join,
> requesting foo sorted by f1,f2,f3 and bar sorted by b1,b2,b3.
> But if foo happens to be sorted by f3,f2,f1, we may miss the
> chance of making use of the delivered ordering of foo. Because
> if we require bar to be sorted by b3,b2,b1, we don't need to
> sort on foo anymore. There are so many choices, n!, not even
> considering asc/desc and null direction. We can't request all
> the possible traitsets in top-down way, and can't derive all the
> possible traitsets in bottom-up way either.
>
> We propose on-demand traitset request by adding a new type
> of metadata DerivedTraitSets into the built-in metadata system.
>
> List<RelTraitSet> deriveTraitSets(RelNode, RelMetadataQuery)
>
> In this metadata, every operator returns several possbile traitsets
> that may be derived from this operator.
>
> Using above query as an example, the tablescan on foo should
> return traiset with collation on f3, f2, f1.
>
> In physical implementation rules, e.g. the SortMergeJoinRule,
> it gets possible traitsets from both child operators, uses the join
> keys to eliminate useless traitsets, leaves out usefull traitsets,
> and requests corresponding traitset on the other child.
>
> This relies on the feature of AbstractConverter, which is turned
> off by default, due to performance issue [1].
>
> Thoughts?
>
> [1] https://issues.apache.org/jira/browse/CALCITE-2970
>
> Haisheng
>
>

Reply via email to