subject:"Exposing multiple values of a trait from the operator"

Re: Exposing multiple values of a trait from the operator

2021-05-26 Thread Stamatis Zampetakis

If the general guideline is to not use composite/multiple traits then we
should deprecate and remove them eventually.

Personally I haven't used them in practice and I haven't seen a jira around
this topic for quite some time now.

If we deprecate them and people complain that we are removing a useful
feature then we can reconsider.

Best,
Stamatis

On Tue, May 25, 2021, 8:43 PM Haisheng Yuan  wrote:

> Negative, currently.
>
> It is indeed complicated, but still acceptable IMO. If the community have
> consensus on this matter, we can create a JIRA to add the feature to trait
> and handle equivalent keys trait satisfaction in trait.satisfy() and column
> remapping in trait.apply(), hope that can ease some pain, if you happen to
> use Calcite's default distribution/collation implementation.
>
> Regards,
> Haisheng Yuan
>
> On 2021/05/25 17:45:32, Vladimir Ozerov  wrote:
> > Hi Haisheng,
> >
> > Thank you for the advice. This is exactly how I designed distribution at
> > the moment (the approach 2 from my original email) - as a List
> > instead of just int[]. My main concern was the increased complexity of
> the
> > trait propagation/derivation, as I have to manage these nested lists by
> > hand. Nevertheless, it works well. So I hoped that there are better
> > built-in approaches that I may use. If the answer is negative, I'll
> > continue using the original approach, when multiple alternatives managed
> > manually.
> >
> > Regards,
> > Vladimir.
> >
> > вт, 25 мая 2021 г. в 20:30, Haisheng Yuan :
> >
> > > Hi Vladimir,
> > >
> > > Glad to see you raised the question.
> > >
> > > Here is the advice:
> > > Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally
> > > flawed and has many bugs. It can't work properly no matter for
> top-down or
> > > bottom-up.
> > >
> > > Instead, we need to add equivalent keys bitmap as the property of
> physical
> > > trait like RelCollation, RelDistribution.
> > >
> > > For example:
> > > class RelDistributionImpl {
> > >   // list of distribution keys
> > >   private ImmutableIntList keys;
> > >
> > >// list of equivalent bitset for each distribution key
> > >   private ImmutableList equivBitSets;
> > > }
> > >
> > > In the trait satisfy and column remapping, we also need to take
> equivalent
> > > keys into consideration. Some of the work need to be done in Calcite
> core
> > > framework.
> > >
> > > Greenplum Orca optimizer has similar strategy:
> > >
> > >
> https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44
> > >
> > > Regards,
> > > Haisheng Yuan
> > >
> > > On 2021/05/25 15:37:32, Vladimir Ozerov  wrote:
> > > > Hi,
> > > >
> > > > Consider the distributed SQL engine that uses a distribution
> property to
> > > > model exchanges. Consider the following physical tree. To do the
> > > > distributed join, we co-locate tuples using the equijoin key. Now the
> > > Join
> > > > operator has two equivalent distributions - [a1] and [b1]. It is
> critical
> > > > to expose both distributions so that the top Aggregate can take
> advantage
> > > > of the co-location.
> > > >
> > > > Aggregate[group=b1]
> > > >   DistributedJoin[a.a1=b.b1]   // SHARDED[a1], SHARDED[b1]
> > > > Input[a]   // SHARDED[a1]
> > > > Input[b]   // SHARDED[b1]
> > > >
> > > > A similar example for the Project:
> > > > Aggregate[group=$1]
> > > >   Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1]
> > > > Input // SHARDED[a]
> > > >
> > > > The question is how to model this situation properly?
> > > >
> > > > First, it seems that RelMultipleTrait and RelCompositeTrait were
> designed
> > > > to handle this situation. However, I couldn't make them work with the
> > > > top-down optimizer. The reason is that when we register a RelNode
> with a
> > > > composite trait in MEMO, VolcanoPlanner flattens the composite trait
> into
> > > > the default trait value in RelSet.add -> RelTraitSet.simplify. That
> is,
> > > the
> > > > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the
> > > > original traits could not be derived in the PhysicalNode.derive
> methods.
> > > >
> > > > Second, we may try to model multiple sharding keys in a single
> trait. But
> > > > this complicates the implementation of
> PhysicalNode.passThrough/derive
> > > > significantly.
> > > > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]]
> > > >
> > > > Third, we may expose multiple traits using metadata.
> RelMdDistribution
> > > > would not work, because it exposes only a single distribution. But a
> > > custom
> > > > handler may potentially fix that. However, it will not be integrated
> with
> > > > the top-down optimizer still, which makes the idea questionable.
> > > >
> > > > To summarize, it seems that currently there is no easy way to handle
> > > > composite traits with a top-down optimizer. I wonder whether someone
> from
> > > > the devlist already solved similar

Re: Exposing multiple values of a trait from the operator

2021-05-25 Thread Haisheng Yuan

Negative, currently.

It is indeed complicated, but still acceptable IMO. If the community have 
consensus on this matter, we can create a JIRA to add the feature to trait and 
handle equivalent keys trait satisfaction in trait.satisfy() and column 
remapping in trait.apply(), hope that can ease some pain, if you happen to use 
Calcite's default distribution/collation implementation. 

Regards,
Haisheng Yuan

On 2021/05/25 17:45:32, Vladimir Ozerov  wrote: 
> Hi Haisheng,
> 
> Thank you for the advice. This is exactly how I designed distribution at
> the moment (the approach 2 from my original email) - as a List
> instead of just int[]. My main concern was the increased complexity of the
> trait propagation/derivation, as I have to manage these nested lists by
> hand. Nevertheless, it works well. So I hoped that there are better
> built-in approaches that I may use. If the answer is negative, I'll
> continue using the original approach, when multiple alternatives managed
> manually.
> 
> Regards,
> Vladimir.
> 
> вт, 25 мая 2021 г. в 20:30, Haisheng Yuan :
> 
> > Hi Vladimir,
> >
> > Glad to see you raised the question.
> >
> > Here is the advice:
> > Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally
> > flawed and has many bugs. It can't work properly no matter for top-down or
> > bottom-up.
> >
> > Instead, we need to add equivalent keys bitmap as the property of physical
> > trait like RelCollation, RelDistribution.
> >
> > For example:
> > class RelDistributionImpl {
> >   // list of distribution keys
> >   private ImmutableIntList keys;
> >
> >// list of equivalent bitset for each distribution key
> >   private ImmutableList equivBitSets;
> > }
> >
> > In the trait satisfy and column remapping, we also need to take equivalent
> > keys into consideration. Some of the work need to be done in Calcite core
> > framework.
> >
> > Greenplum Orca optimizer has similar strategy:
> >
> > https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44
> >
> > Regards,
> > Haisheng Yuan
> >
> > On 2021/05/25 15:37:32, Vladimir Ozerov  wrote:
> > > Hi,
> > >
> > > Consider the distributed SQL engine that uses a distribution property to
> > > model exchanges. Consider the following physical tree. To do the
> > > distributed join, we co-locate tuples using the equijoin key. Now the
> > Join
> > > operator has two equivalent distributions - [a1] and [b1]. It is critical
> > > to expose both distributions so that the top Aggregate can take advantage
> > > of the co-location.
> > >
> > > Aggregate[group=b1]
> > >   DistributedJoin[a.a1=b.b1]   // SHARDED[a1], SHARDED[b1]
> > > Input[a]   // SHARDED[a1]
> > > Input[b]   // SHARDED[b1]
> > >
> > > A similar example for the Project:
> > > Aggregate[group=$1]
> > >   Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1]
> > > Input // SHARDED[a]
> > >
> > > The question is how to model this situation properly?
> > >
> > > First, it seems that RelMultipleTrait and RelCompositeTrait were designed
> > > to handle this situation. However, I couldn't make them work with the
> > > top-down optimizer. The reason is that when we register a RelNode with a
> > > composite trait in MEMO, VolcanoPlanner flattens the composite trait into
> > > the default trait value in RelSet.add -> RelTraitSet.simplify. That is,
> > the
> > > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the
> > > original traits could not be derived in the PhysicalNode.derive methods.
> > >
> > > Second, we may try to model multiple sharding keys in a single trait. But
> > > this complicates the implementation of PhysicalNode.passThrough/derive
> > > significantly.
> > > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]]
> > >
> > > Third, we may expose multiple traits using metadata. RelMdDistribution
> > > would not work, because it exposes only a single distribution. But a
> > custom
> > > handler may potentially fix that. However, it will not be integrated with
> > > the top-down optimizer still, which makes the idea questionable.
> > >
> > > To summarize, it seems that currently there is no easy way to handle
> > > composite traits with a top-down optimizer. I wonder whether someone from
> > > the devlist already solved similar issues in Apache Calcite or other
> > > optimizers. If so, what was the approach or best practices? Intuitively,
> > it
> > > seems that RelMultipleTrait/RelCompositeTrait approach might be the way
> > to
> > > go. But why do we replace the original composite trait set with the
> > default
> > > value in the RelTraitSet.simplify routine?
> > >
> > > Regards,
> > > Vladimir.
> > >
> >
>

Re: Exposing multiple values of a trait from the operator

2021-05-25 Thread Vladimir Ozerov

Hi Haisheng,

Thank you for the advice. This is exactly how I designed distribution at
the moment (the approach 2 from my original email) - as a List
instead of just int[]. My main concern was the increased complexity of the
trait propagation/derivation, as I have to manage these nested lists by
hand. Nevertheless, it works well. So I hoped that there are better
built-in approaches that I may use. If the answer is negative, I'll
continue using the original approach, when multiple alternatives managed
manually.

Regards,
Vladimir.

вт, 25 мая 2021 г. в 20:30, Haisheng Yuan :

> Hi Vladimir,
>
> Glad to see you raised the question.
>
> Here is the advice:
> Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally
> flawed and has many bugs. It can't work properly no matter for top-down or
> bottom-up.
>
> Instead, we need to add equivalent keys bitmap as the property of physical
> trait like RelCollation, RelDistribution.
>
> For example:
> class RelDistributionImpl {
>   // list of distribution keys
>   private ImmutableIntList keys;
>
>// list of equivalent bitset for each distribution key
>   private ImmutableList equivBitSets;
> }
>
> In the trait satisfy and column remapping, we also need to take equivalent
> keys into consideration. Some of the work need to be done in Calcite core
> framework.
>
> Greenplum Orca optimizer has similar strategy:
>
> https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44
>
> Regards,
> Haisheng Yuan
>
> On 2021/05/25 15:37:32, Vladimir Ozerov  wrote:
> > Hi,
> >
> > Consider the distributed SQL engine that uses a distribution property to
> > model exchanges. Consider the following physical tree. To do the
> > distributed join, we co-locate tuples using the equijoin key. Now the
> Join
> > operator has two equivalent distributions - [a1] and [b1]. It is critical
> > to expose both distributions so that the top Aggregate can take advantage
> > of the co-location.
> >
> > Aggregate[group=b1]
> >   DistributedJoin[a.a1=b.b1]   // SHARDED[a1], SHARDED[b1]
> > Input[a]   // SHARDED[a1]
> > Input[b]   // SHARDED[b1]
> >
> > A similar example for the Project:
> > Aggregate[group=$1]
> >   Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1]
> > Input // SHARDED[a]
> >
> > The question is how to model this situation properly?
> >
> > First, it seems that RelMultipleTrait and RelCompositeTrait were designed
> > to handle this situation. However, I couldn't make them work with the
> > top-down optimizer. The reason is that when we register a RelNode with a
> > composite trait in MEMO, VolcanoPlanner flattens the composite trait into
> > the default trait value in RelSet.add -> RelTraitSet.simplify. That is,
> the
> > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the
> > original traits could not be derived in the PhysicalNode.derive methods.
> >
> > Second, we may try to model multiple sharding keys in a single trait. But
> > this complicates the implementation of PhysicalNode.passThrough/derive
> > significantly.
> > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]]
> >
> > Third, we may expose multiple traits using metadata. RelMdDistribution
> > would not work, because it exposes only a single distribution. But a
> custom
> > handler may potentially fix that. However, it will not be integrated with
> > the top-down optimizer still, which makes the idea questionable.
> >
> > To summarize, it seems that currently there is no easy way to handle
> > composite traits with a top-down optimizer. I wonder whether someone from
> > the devlist already solved similar issues in Apache Calcite or other
> > optimizers. If so, what was the approach or best practices? Intuitively,
> it
> > seems that RelMultipleTrait/RelCompositeTrait approach might be the way
> to
> > go. But why do we replace the original composite trait set with the
> default
> > value in the RelTraitSet.simplify routine?
> >
> > Regards,
> > Vladimir.
> >
>

Re: Exposing multiple values of a trait from the operator

2021-05-25 Thread Haisheng Yuan

Hi Vladimir,

Glad to see you raised the question.

Here is the advice:
Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally flawed 
and has many bugs. It can't work properly no matter for top-down or bottom-up.

Instead, we need to add equivalent keys bitmap as the property of physical 
trait like RelCollation, RelDistribution.

For example:
class RelDistributionImpl {
  // list of distribution keys
  private ImmutableIntList keys; 

   // list of equivalent bitset for each distribution key
  private ImmutableList equivBitSets; 
}

In the trait satisfy and column remapping, we also need to take equivalent keys 
into consideration. Some of the work need to be done in Calcite core framework.

Greenplum Orca optimizer has similar strategy:
https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44

Regards,
Haisheng Yuan

On 2021/05/25 15:37:32, Vladimir Ozerov  wrote: 
> Hi,
> 
> Consider the distributed SQL engine that uses a distribution property to
> model exchanges. Consider the following physical tree. To do the
> distributed join, we co-locate tuples using the equijoin key. Now the Join
> operator has two equivalent distributions - [a1] and [b1]. It is critical
> to expose both distributions so that the top Aggregate can take advantage
> of the co-location.
> 
> Aggregate[group=b1]
>   DistributedJoin[a.a1=b.b1]   // SHARDED[a1], SHARDED[b1]
> Input[a]   // SHARDED[a1]
> Input[b]   // SHARDED[b1]
> 
> A similar example for the Project:
> Aggregate[group=$1]
>   Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1]
> Input // SHARDED[a]
> 
> The question is how to model this situation properly?
> 
> First, it seems that RelMultipleTrait and RelCompositeTrait were designed
> to handle this situation. However, I couldn't make them work with the
> top-down optimizer. The reason is that when we register a RelNode with a
> composite trait in MEMO, VolcanoPlanner flattens the composite trait into
> the default trait value in RelSet.add -> RelTraitSet.simplify. That is, the
> trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the
> original traits could not be derived in the PhysicalNode.derive methods.
> 
> Second, we may try to model multiple sharding keys in a single trait. But
> this complicates the implementation of PhysicalNode.passThrough/derive
> significantly.
> SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]]
> 
> Third, we may expose multiple traits using metadata. RelMdDistribution
> would not work, because it exposes only a single distribution. But a custom
> handler may potentially fix that. However, it will not be integrated with
> the top-down optimizer still, which makes the idea questionable.
> 
> To summarize, it seems that currently there is no easy way to handle
> composite traits with a top-down optimizer. I wonder whether someone from
> the devlist already solved similar issues in Apache Calcite or other
> optimizers. If so, what was the approach or best practices? Intuitively, it
> seems that RelMultipleTrait/RelCompositeTrait approach might be the way to
> go. But why do we replace the original composite trait set with the default
> value in the RelTraitSet.simplify routine?
> 
> Regards,
> Vladimir.
>

Re: Exposing multiple values of a trait from the operator

2021-05-25 Thread Vladimir Ozerov

Hi Vladimir,

Thank you for the link. It is very relevant to my problem. I see that in
these discussions, there were several ideas and claims, such as that (1) we
can get rid of "simplify" altogether, (2) composite traits are rare in
practice, (3) composite traits are not designed well in the first place
[1]. I do not have the full picture in my head, so I'll try to share some
thoughts to advance the discussion.

Regarding (1), I think that the removal of "simplify" may help with my
particular (and pretty simple) test but might lead to some unpredictable
results for more complicated queries. Suppose that we generate two
equivalent nodes with different traits: [a] and [a][b]. Depending on the
nature of the trait def, these two nodes might or might belong to the same
subset. For example, [a] and [a][b] are different subsets for
RelCollcation. At the same time, [a] and [a][b] could belong to the same
subset for some distributions. That is, if the input is hash-distributed by
either [a] or [b], it might imply that a==b for every tuple (otherwise,
hashes will not match), and therefore every RelNode in the RelSet that is
shared by [a] is also sharded by [b] and vice verse. The idea is similar to
transitive predicates. So ideally, we should let the RelTraitDef define how
to compare composite traits with other traits. Otherwise, we may lose some
optimization opportunities.

Regarding (2), perhaps the multi-collation nodes are really rare in
practice. But nodes with multiple hash distributions are widespread for
distributed engines. Because in distributed systems, the collocated hash
equijoin is the most common way of joining two inputs, and such join always
produces an additional distribution.

Regarding (3), it would be very interesting to hear suggestions and ideas
on the proper design of composite traits. The composite traits mechanics
mentioned in RelSubset Javadoc's is not a good design choice for
distribution traits. That is, if we have a node that is distributed by
[a][b], we cannot just put it into two subsets [a] and [b], because
operator parents may require both [a] and [b], otherwise unnecessary
exchanges could appear. That is, [a][b] should be propagated together. For
example, the removal of SHARDED[a1] from #1 would add the exchange between
#2 and #1, and the removal of SHARDED[b1] from #1 would add the exchange
between #3 and #2. Neither is optimal.
3: Aggregate[group=b1]
2:   Join[a.a1=c.c1]   // SHARDED[a1], SHARDED[b1], SHARDED[c1]
1: Join[a.a1=b.b1] // SHARDED[a1], SHARDED[b1]

@Haisheng Yuan , following your comment [1], would
you mind providing your ideas around the proper design of composite traits?
Are composite traits implemented in Orca?

Regards,
Vladimir.

[1]
https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=17081984=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17081984

вт, 25 мая 2021 г. в 19:32, Vladimir Sitnikov :

> >VolcanoPlanner flattens the composite trait into
> the default trait value in RelSet.add -> RelTraitSet.simplify
>
> Vladimir, have you tried removing that RelTraitSet.simplify?
>
> I remember I have run into that multiple times already, and I suggested
> removing that "simplify".
> For example,
>
> https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=16750377=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16750377
>
>
> Vladimir
>

Re: Exposing multiple values of a trait from the operator

2021-05-25 Thread Vladimir Sitnikov

>VolcanoPlanner flattens the composite trait into
the default trait value in RelSet.add -> RelTraitSet.simplify

Vladimir, have you tried removing that RelTraitSet.simplify?

I remember I have run into that multiple times already, and I suggested
removing that "simplify".
For example,
https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=16750377=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16750377


Vladimir

Exposing multiple values of a trait from the operator

2021-05-25 Thread Vladimir Ozerov

Hi,

Consider the distributed SQL engine that uses a distribution property to
model exchanges. Consider the following physical tree. To do the
distributed join, we co-locate tuples using the equijoin key. Now the Join
operator has two equivalent distributions - [a1] and [b1]. It is critical
to expose both distributions so that the top Aggregate can take advantage
of the co-location.

Aggregate[group=b1]
  DistributedJoin[a.a1=b.b1]   // SHARDED[a1], SHARDED[b1]
Input[a]   // SHARDED[a1]
Input[b]   // SHARDED[b1]

A similar example for the Project:
Aggregate[group=$1]
  Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1]
Input // SHARDED[a]

The question is how to model this situation properly?

First, it seems that RelMultipleTrait and RelCompositeTrait were designed
to handle this situation. However, I couldn't make them work with the
top-down optimizer. The reason is that when we register a RelNode with a
composite trait in MEMO, VolcanoPlanner flattens the composite trait into
the default trait value in RelSet.add -> RelTraitSet.simplify. That is, the
trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the
original traits could not be derived in the PhysicalNode.derive methods.

Second, we may try to model multiple sharding keys in a single trait. But
this complicates the implementation of PhysicalNode.passThrough/derive
significantly.
SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]]

Third, we may expose multiple traits using metadata. RelMdDistribution
would not work, because it exposes only a single distribution. But a custom
handler may potentially fix that. However, it will not be integrated with
the top-down optimizer still, which makes the idea questionable.

To summarize, it seems that currently there is no easy way to handle
composite traits with a top-down optimizer. I wonder whether someone from
the devlist already solved similar issues in Apache Calcite or other
optimizers. If so, what was the approach or best practices? Intuitively, it
seems that RelMultipleTrait/RelCompositeTrait approach might be the way to
go. But why do we replace the original composite trait set with the default
value in the RelTraitSet.simplify routine?

Regards,
Vladimir.

Re: Exposing multiple values of a trait from the operator

Re: Exposing multiple values of a trait from the operator

Re: Exposing multiple values of a trait from the operator

Re: Exposing multiple values of a trait from the operator

Re: Exposing multiple values of a trait from the operator

Re: Exposing multiple values of a trait from the operator

Exposing multiple values of a trait from the operator

7 matches

Site Navigation

Mail list logo

Footer information