Re: Exposing multiple values of a trait from the operator

Jinpeng Wu Wed, 26 May 2021 19:43:30 -0700

+1 for  Haisheng's suggestion, the equivBitSets way.

By overriding the "satisfies" method, in my understanding, only the
"derive" method introduces more complexity. The complexity mainly lies in
defining how equivBitSets are translated from input to output and how to
declare field equivalence, both of which can be simplified by helper
methods.  And it is safe: it is fully compatible with the original design
and plan changes introduced by it are all under control. It would be nice
if calcite could have such a mechanism by default.


One problem I can think of, the equivBitSets solution cannot solve all the
problems. For example, the equivBitSets can hardly represent such a
situation: dataset x has c1 as primary key and [c2, c3] as unique key, so
it can deliver dist[c1] and dist[c2, c3] at the same time. But I think such
cases are relatively rare.

Remind me if I misunderstand anything.

Thanks,
Jinpeng Wu



On Thu, May 27, 2021 at 5:15 AM Stamatis Zampetakis <zabe...@gmail.com>
wrote:

> If the general guideline is to not use composite/multiple traits then we
> should deprecate and remove them eventually.
>
> Personally I haven't used them in practice and I haven't seen a jira around
> this topic for quite some time now.
>
> If we deprecate them and people complain that we are removing a useful
> feature then we can reconsider.
>
> Best,
> Stamatis
>
> On Tue, May 25, 2021, 8:43 PM Haisheng Yuan <hy...@apache.org> wrote:
>
> > Negative, currently.
> >
> > It is indeed complicated, but still acceptable IMO. If the community have
> > consensus on this matter, we can create a JIRA to add the feature to
> trait
> > and handle equivalent keys trait satisfaction in trait.satisfy() and
> column
> > remapping in trait.apply(), hope that can ease some pain, if you happen
> to
> > use Calcite's default distribution/collation implementation.
> >
> > Regards,
> > Haisheng Yuan
> >
> > On 2021/05/25 17:45:32, Vladimir Ozerov <ppoze...@gmail.com> wrote:
> > > Hi Haisheng,
> > >
> > > Thank you for the advice. This is exactly how I designed distribution
> at
> > > the moment (the approach 2 from my original email) - as a List<int[]>
> > > instead of just int[]. My main concern was the increased complexity of
> > the
> > > trait propagation/derivation, as I have to manage these nested lists by
> > > hand. Nevertheless, it works well. So I hoped that there are better
> > > built-in approaches that I may use. If the answer is negative, I'll
> > > continue using the original approach, when multiple alternatives
> managed
> > > manually.
> > >
> > > Regards,
> > > Vladimir.
> > >
> > > вт, 25 мая 2021 г. в 20:30, Haisheng Yuan <hy...@apache.org>:
> > >
> > > > Hi Vladimir,
> > > >
> > > > Glad to see you raised the question.
> > > >
> > > > Here is the advice:
> > > > Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally
> > > > flawed and has many bugs. It can't work properly no matter for
> > top-down or
> > > > bottom-up.
> > > >
> > > > Instead, we need to add equivalent keys bitmap as the property of
> > physical
> > > > trait like RelCollation, RelDistribution.
> > > >
> > > > For example:
> > > > class RelDistributionImpl {
> > > >   // list of distribution keys
> > > >   private ImmutableIntList keys;
> > > >
> > > >    // list of equivalent bitset for each distribution key
> > > >   private ImmutableList<ImmutableBitSet> equivBitSets;
> > > > }
> > > >
> > > > In the trait satisfy and column remapping, we also need to take
> > equivalent
> > > > keys into consideration. Some of the work need to be done in Calcite
> > core
> > > > framework.
> > > >
> > > > Greenplum Orca optimizer has similar strategy:
> > > >
> > > >
> >
> https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44
> > > >
> > > > Regards,
> > > > Haisheng Yuan
> > > >
> > > > On 2021/05/25 15:37:32, Vladimir Ozerov <ppoze...@gmail.com> wrote:
> > > > > Hi,
> > > > >
> > > > > Consider the distributed SQL engine that uses a distribution
> > property to
> > > > > model exchanges. Consider the following physical tree. To do the
> > > > > distributed join, we co-locate tuples using the equijoin key. Now
> the
> > > > Join
> > > > > operator has two equivalent distributions - [a1] and [b1]. It is
> > critical
> > > > > to expose both distributions so that the top Aggregate can take
> > advantage
> > > > > of the co-location.
> > > > >
> > > > > Aggregate[group=b1]
> > > > >   DistributedJoin[a.a1=b.b1]   // SHARDED[a1], SHARDED[b1]
> > > > >     Input[a]                   // SHARDED[a1]
> > > > >     Input[b]                   // SHARDED[b1]
> > > > >
> > > > > A similar example for the Project:
> > > > > Aggregate[group=$1]
> > > > >   Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1]
> > > > >     Input             // SHARDED[a]
> > > > >
> > > > > The question is how to model this situation properly?
> > > > >
> > > > > First, it seems that RelMultipleTrait and RelCompositeTrait were
> > designed
> > > > > to handle this situation. However, I couldn't make them work with
> the
> > > > > top-down optimizer. The reason is that when we register a RelNode
> > with a
> > > > > composite trait in MEMO, VolcanoPlanner flattens the composite
> trait
> > into
> > > > > the default trait value in RelSet.add -> RelTraitSet.simplify. That
> > is,
> > > > the
> > > > > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that
> the
> > > > > original traits could not be derived in the PhysicalNode.derive
> > methods.
> > > > >
> > > > > Second, we may try to model multiple sharding keys in a single
> > trait. But
> > > > > this complicates the implementation of
> > PhysicalNode.passThrough/derive
> > > > > significantly.
> > > > > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]]
> > > > >
> > > > > Third, we may expose multiple traits using metadata.
> > RelMdDistribution
> > > > > would not work, because it exposes only a single distribution. But
> a
> > > > custom
> > > > > handler may potentially fix that. However, it will not be
> integrated
> > with
> > > > > the top-down optimizer still, which makes the idea questionable.
> > > > >
> > > > > To summarize, it seems that currently there is no easy way to
> handle
> > > > > composite traits with a top-down optimizer. I wonder whether
> someone
> > from
> > > > > the devlist already solved similar issues in Apache Calcite or
> other
> > > > > optimizers. If so, what was the approach or best practices?
> > Intuitively,
> > > > it
> > > > > seems that RelMultipleTrait/RelCompositeTrait approach might be the
> > way
> > > > to
> > > > > go. But why do we replace the original composite trait set with the
> > > > default
> > > > > value in the RelTraitSet.simplify routine?
> > > > >
> > > > > Regards,
> > > > > Vladimir.
> > > > >
> > > >
> > >
> >
>

Re: Exposing multiple values of a trait from the operator

Reply via email to