Re: Exposing multiple values of a trait from the operator

Stamatis Zampetakis Wed, 26 May 2021 14:15:11 -0700

If the general guideline is to not use composite/multiple traits then we
should deprecate and remove them eventually.


Personally I haven't used them in practice and I haven't seen a jira around
this topic for quite some time now.

If we deprecate them and people complain that we are removing a useful
feature then we can reconsider.

Best,
Stamatis

On Tue, May 25, 2021, 8:43 PM Haisheng Yuan <hy...@apache.org> wrote:

> Negative, currently.
>
> It is indeed complicated, but still acceptable IMO. If the community have
> consensus on this matter, we can create a JIRA to add the feature to trait
> and handle equivalent keys trait satisfaction in trait.satisfy() and column
> remapping in trait.apply(), hope that can ease some pain, if you happen to
> use Calcite's default distribution/collation implementation.
>
> Regards,
> Haisheng Yuan
>
> On 2021/05/25 17:45:32, Vladimir Ozerov <ppoze...@gmail.com> wrote:
> > Hi Haisheng,
> >
> > Thank you for the advice. This is exactly how I designed distribution at
> > the moment (the approach 2 from my original email) - as a List<int[]>
> > instead of just int[]. My main concern was the increased complexity of
> the
> > trait propagation/derivation, as I have to manage these nested lists by
> > hand. Nevertheless, it works well. So I hoped that there are better
> > built-in approaches that I may use. If the answer is negative, I'll
> > continue using the original approach, when multiple alternatives managed
> > manually.
> >
> > Regards,
> > Vladimir.
> >
> > вт, 25 мая 2021 г. в 20:30, Haisheng Yuan <hy...@apache.org>:
> >
> > > Hi Vladimir,
> > >
> > > Glad to see you raised the question.
> > >
> > > Here is the advice:
> > > Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally
> > > flawed and has many bugs. It can't work properly no matter for
> top-down or
> > > bottom-up.
> > >
> > > Instead, we need to add equivalent keys bitmap as the property of
> physical
> > > trait like RelCollation, RelDistribution.
> > >
> > > For example:
> > > class RelDistributionImpl {
> > >   // list of distribution keys
> > >   private ImmutableIntList keys;
> > >
> > >    // list of equivalent bitset for each distribution key
> > >   private ImmutableList<ImmutableBitSet> equivBitSets;
> > > }
> > >
> > > In the trait satisfy and column remapping, we also need to take
> equivalent
> > > keys into consideration. Some of the work need to be done in Calcite
> core
> > > framework.
> > >
> > > Greenplum Orca optimizer has similar strategy:
> > >
> > >
> https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44
> > >
> > > Regards,
> > > Haisheng Yuan
> > >
> > > On 2021/05/25 15:37:32, Vladimir Ozerov <ppoze...@gmail.com> wrote:
> > > > Hi,
> > > >
> > > > Consider the distributed SQL engine that uses a distribution
> property to
> > > > model exchanges. Consider the following physical tree. To do the
> > > > distributed join, we co-locate tuples using the equijoin key. Now the
> > > Join
> > > > operator has two equivalent distributions - [a1] and [b1]. It is
> critical
> > > > to expose both distributions so that the top Aggregate can take
> advantage
> > > > of the co-location.
> > > >
> > > > Aggregate[group=b1]
> > > >   DistributedJoin[a.a1=b.b1]   // SHARDED[a1], SHARDED[b1]
> > > >     Input[a]                   // SHARDED[a1]
> > > >     Input[b]                   // SHARDED[b1]
> > > >
> > > > A similar example for the Project:
> > > > Aggregate[group=$1]
> > > >   Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1]
> > > >     Input             // SHARDED[a]
> > > >
> > > > The question is how to model this situation properly?
> > > >
> > > > First, it seems that RelMultipleTrait and RelCompositeTrait were
> designed
> > > > to handle this situation. However, I couldn't make them work with the
> > > > top-down optimizer. The reason is that when we register a RelNode
> with a
> > > > composite trait in MEMO, VolcanoPlanner flattens the composite trait
> into
> > > > the default trait value in RelSet.add -> RelTraitSet.simplify. That
> is,
> > > the
> > > > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the
> > > > original traits could not be derived in the PhysicalNode.derive
> methods.
> > > >
> > > > Second, we may try to model multiple sharding keys in a single
> trait. But
> > > > this complicates the implementation of
> PhysicalNode.passThrough/derive
> > > > significantly.
> > > > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]]
> > > >
> > > > Third, we may expose multiple traits using metadata.
> RelMdDistribution
> > > > would not work, because it exposes only a single distribution. But a
> > > custom
> > > > handler may potentially fix that. However, it will not be integrated
> with
> > > > the top-down optimizer still, which makes the idea questionable.
> > > >
> > > > To summarize, it seems that currently there is no easy way to handle
> > > > composite traits with a top-down optimizer. I wonder whether someone
> from
> > > > the devlist already solved similar issues in Apache Calcite or other
> > > > optimizers. If so, what was the approach or best practices?
> Intuitively,
> > > it
> > > > seems that RelMultipleTrait/RelCompositeTrait approach might be the
> way
> > > to
> > > > go. But why do we replace the original composite trait set with the
> > > default
> > > > value in the RelTraitSet.simplify routine?
> > > >
> > > > Regards,
> > > > Vladimir.
> > > >
> > >
> >
>

Re: Exposing multiple values of a trait from the operator

Reply via email to