Re: Exposing multiple values of a trait from the operator
If the general guideline is to not use composite/multiple traits then we should deprecate and remove them eventually. Personally I haven't used them in practice and I haven't seen a jira around this topic for quite some time now. If we deprecate them and people complain that we are removing a useful feature then we can reconsider. Best, Stamatis On Tue, May 25, 2021, 8:43 PM Haisheng Yuan wrote: > Negative, currently. > > It is indeed complicated, but still acceptable IMO. If the community have > consensus on this matter, we can create a JIRA to add the feature to trait > and handle equivalent keys trait satisfaction in trait.satisfy() and column > remapping in trait.apply(), hope that can ease some pain, if you happen to > use Calcite's default distribution/collation implementation. > > Regards, > Haisheng Yuan > > On 2021/05/25 17:45:32, Vladimir Ozerov wrote: > > Hi Haisheng, > > > > Thank you for the advice. This is exactly how I designed distribution at > > the moment (the approach 2 from my original email) - as a List > > instead of just int[]. My main concern was the increased complexity of > the > > trait propagation/derivation, as I have to manage these nested lists by > > hand. Nevertheless, it works well. So I hoped that there are better > > built-in approaches that I may use. If the answer is negative, I'll > > continue using the original approach, when multiple alternatives managed > > manually. > > > > Regards, > > Vladimir. > > > > вт, 25 мая 2021 г. в 20:30, Haisheng Yuan : > > > > > Hi Vladimir, > > > > > > Glad to see you raised the question. > > > > > > Here is the advice: > > > Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally > > > flawed and has many bugs. It can't work properly no matter for > top-down or > > > bottom-up. > > > > > > Instead, we need to add equivalent keys bitmap as the property of > physical > > > trait like RelCollation, RelDistribution. > > > > > > For example: > > > class RelDistributionImpl { > > > // list of distribution keys > > > private ImmutableIntList keys; > > > > > >// list of equivalent bitset for each distribution key > > > private ImmutableList equivBitSets; > > > } > > > > > > In the trait satisfy and column remapping, we also need to take > equivalent > > > keys into consideration. Some of the work need to be done in Calcite > core > > > framework. > > > > > > Greenplum Orca optimizer has similar strategy: > > > > > > > https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44 > > > > > > Regards, > > > Haisheng Yuan > > > > > > On 2021/05/25 15:37:32, Vladimir Ozerov wrote: > > > > Hi, > > > > > > > > Consider the distributed SQL engine that uses a distribution > property to > > > > model exchanges. Consider the following physical tree. To do the > > > > distributed join, we co-locate tuples using the equijoin key. Now the > > > Join > > > > operator has two equivalent distributions - [a1] and [b1]. It is > critical > > > > to expose both distributions so that the top Aggregate can take > advantage > > > > of the co-location. > > > > > > > > Aggregate[group=b1] > > > > DistributedJoin[a.a1=b.b1] // SHARDED[a1], SHARDED[b1] > > > > Input[a] // SHARDED[a1] > > > > Input[b] // SHARDED[b1] > > > > > > > > A similar example for the Project: > > > > Aggregate[group=$1] > > > > Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1] > > > > Input // SHARDED[a] > > > > > > > > The question is how to model this situation properly? > > > > > > > > First, it seems that RelMultipleTrait and RelCompositeTrait were > designed > > > > to handle this situation. However, I couldn't make them work with the > > > > top-down optimizer. The reason is that when we register a RelNode > with a > > > > composite trait in MEMO, VolcanoPlanner flattens the composite trait > into > > > > the default trait value in RelSet.add -> RelTraitSet.simplify. That > is, > > > the > > > > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the > > > > original traits could not be derived in the PhysicalNode.derive > methods. > > > > > > > > Second, we may try to model multiple sharding keys in a single > trait. But > > > > this complicates the implementation of > PhysicalNode.passThrough/derive > > > > significantly. > > > > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]] > > > > > > > > Third, we may expose multiple traits using metadata. > RelMdDistribution > > > > would not work, because it exposes only a single distribution. But a > > > custom > > > > handler may potentially fix that. However, it will not be integrated > with > > > > the top-down optimizer still, which makes the idea questionable. > > > > > > > > To summarize, it seems that currently there is no easy way to handle > > > > composite traits with a top-down optimizer. I wonder whether someone > from > > > > the devlist already solved similar
Re: Exposing multiple values of a trait from the operator
Negative, currently. It is indeed complicated, but still acceptable IMO. If the community have consensus on this matter, we can create a JIRA to add the feature to trait and handle equivalent keys trait satisfaction in trait.satisfy() and column remapping in trait.apply(), hope that can ease some pain, if you happen to use Calcite's default distribution/collation implementation. Regards, Haisheng Yuan On 2021/05/25 17:45:32, Vladimir Ozerov wrote: > Hi Haisheng, > > Thank you for the advice. This is exactly how I designed distribution at > the moment (the approach 2 from my original email) - as a List > instead of just int[]. My main concern was the increased complexity of the > trait propagation/derivation, as I have to manage these nested lists by > hand. Nevertheless, it works well. So I hoped that there are better > built-in approaches that I may use. If the answer is negative, I'll > continue using the original approach, when multiple alternatives managed > manually. > > Regards, > Vladimir. > > вт, 25 мая 2021 г. в 20:30, Haisheng Yuan : > > > Hi Vladimir, > > > > Glad to see you raised the question. > > > > Here is the advice: > > Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally > > flawed and has many bugs. It can't work properly no matter for top-down or > > bottom-up. > > > > Instead, we need to add equivalent keys bitmap as the property of physical > > trait like RelCollation, RelDistribution. > > > > For example: > > class RelDistributionImpl { > > // list of distribution keys > > private ImmutableIntList keys; > > > >// list of equivalent bitset for each distribution key > > private ImmutableList equivBitSets; > > } > > > > In the trait satisfy and column remapping, we also need to take equivalent > > keys into consideration. Some of the work need to be done in Calcite core > > framework. > > > > Greenplum Orca optimizer has similar strategy: > > > > https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44 > > > > Regards, > > Haisheng Yuan > > > > On 2021/05/25 15:37:32, Vladimir Ozerov wrote: > > > Hi, > > > > > > Consider the distributed SQL engine that uses a distribution property to > > > model exchanges. Consider the following physical tree. To do the > > > distributed join, we co-locate tuples using the equijoin key. Now the > > Join > > > operator has two equivalent distributions - [a1] and [b1]. It is critical > > > to expose both distributions so that the top Aggregate can take advantage > > > of the co-location. > > > > > > Aggregate[group=b1] > > > DistributedJoin[a.a1=b.b1] // SHARDED[a1], SHARDED[b1] > > > Input[a] // SHARDED[a1] > > > Input[b] // SHARDED[b1] > > > > > > A similar example for the Project: > > > Aggregate[group=$1] > > > Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1] > > > Input // SHARDED[a] > > > > > > The question is how to model this situation properly? > > > > > > First, it seems that RelMultipleTrait and RelCompositeTrait were designed > > > to handle this situation. However, I couldn't make them work with the > > > top-down optimizer. The reason is that when we register a RelNode with a > > > composite trait in MEMO, VolcanoPlanner flattens the composite trait into > > > the default trait value in RelSet.add -> RelTraitSet.simplify. That is, > > the > > > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the > > > original traits could not be derived in the PhysicalNode.derive methods. > > > > > > Second, we may try to model multiple sharding keys in a single trait. But > > > this complicates the implementation of PhysicalNode.passThrough/derive > > > significantly. > > > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]] > > > > > > Third, we may expose multiple traits using metadata. RelMdDistribution > > > would not work, because it exposes only a single distribution. But a > > custom > > > handler may potentially fix that. However, it will not be integrated with > > > the top-down optimizer still, which makes the idea questionable. > > > > > > To summarize, it seems that currently there is no easy way to handle > > > composite traits with a top-down optimizer. I wonder whether someone from > > > the devlist already solved similar issues in Apache Calcite or other > > > optimizers. If so, what was the approach or best practices? Intuitively, > > it > > > seems that RelMultipleTrait/RelCompositeTrait approach might be the way > > to > > > go. But why do we replace the original composite trait set with the > > default > > > value in the RelTraitSet.simplify routine? > > > > > > Regards, > > > Vladimir. > > > > > >
Re: Exposing multiple values of a trait from the operator
Hi Haisheng, Thank you for the advice. This is exactly how I designed distribution at the moment (the approach 2 from my original email) - as a List instead of just int[]. My main concern was the increased complexity of the trait propagation/derivation, as I have to manage these nested lists by hand. Nevertheless, it works well. So I hoped that there are better built-in approaches that I may use. If the answer is negative, I'll continue using the original approach, when multiple alternatives managed manually. Regards, Vladimir. вт, 25 мая 2021 г. в 20:30, Haisheng Yuan : > Hi Vladimir, > > Glad to see you raised the question. > > Here is the advice: > Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally > flawed and has many bugs. It can't work properly no matter for top-down or > bottom-up. > > Instead, we need to add equivalent keys bitmap as the property of physical > trait like RelCollation, RelDistribution. > > For example: > class RelDistributionImpl { > // list of distribution keys > private ImmutableIntList keys; > >// list of equivalent bitset for each distribution key > private ImmutableList equivBitSets; > } > > In the trait satisfy and column remapping, we also need to take equivalent > keys into consideration. Some of the work need to be done in Calcite core > framework. > > Greenplum Orca optimizer has similar strategy: > > https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44 > > Regards, > Haisheng Yuan > > On 2021/05/25 15:37:32, Vladimir Ozerov wrote: > > Hi, > > > > Consider the distributed SQL engine that uses a distribution property to > > model exchanges. Consider the following physical tree. To do the > > distributed join, we co-locate tuples using the equijoin key. Now the > Join > > operator has two equivalent distributions - [a1] and [b1]. It is critical > > to expose both distributions so that the top Aggregate can take advantage > > of the co-location. > > > > Aggregate[group=b1] > > DistributedJoin[a.a1=b.b1] // SHARDED[a1], SHARDED[b1] > > Input[a] // SHARDED[a1] > > Input[b] // SHARDED[b1] > > > > A similar example for the Project: > > Aggregate[group=$1] > > Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1] > > Input // SHARDED[a] > > > > The question is how to model this situation properly? > > > > First, it seems that RelMultipleTrait and RelCompositeTrait were designed > > to handle this situation. However, I couldn't make them work with the > > top-down optimizer. The reason is that when we register a RelNode with a > > composite trait in MEMO, VolcanoPlanner flattens the composite trait into > > the default trait value in RelSet.add -> RelTraitSet.simplify. That is, > the > > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the > > original traits could not be derived in the PhysicalNode.derive methods. > > > > Second, we may try to model multiple sharding keys in a single trait. But > > this complicates the implementation of PhysicalNode.passThrough/derive > > significantly. > > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]] > > > > Third, we may expose multiple traits using metadata. RelMdDistribution > > would not work, because it exposes only a single distribution. But a > custom > > handler may potentially fix that. However, it will not be integrated with > > the top-down optimizer still, which makes the idea questionable. > > > > To summarize, it seems that currently there is no easy way to handle > > composite traits with a top-down optimizer. I wonder whether someone from > > the devlist already solved similar issues in Apache Calcite or other > > optimizers. If so, what was the approach or best practices? Intuitively, > it > > seems that RelMultipleTrait/RelCompositeTrait approach might be the way > to > > go. But why do we replace the original composite trait set with the > default > > value in the RelTraitSet.simplify routine? > > > > Regards, > > Vladimir. > > >
Re: Exposing multiple values of a trait from the operator
Hi Vladimir, Glad to see you raised the question. Here is the advice: Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally flawed and has many bugs. It can't work properly no matter for top-down or bottom-up. Instead, we need to add equivalent keys bitmap as the property of physical trait like RelCollation, RelDistribution. For example: class RelDistributionImpl { // list of distribution keys private ImmutableIntList keys; // list of equivalent bitset for each distribution key private ImmutableList equivBitSets; } In the trait satisfy and column remapping, we also need to take equivalent keys into consideration. Some of the work need to be done in Calcite core framework. Greenplum Orca optimizer has similar strategy: https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44 Regards, Haisheng Yuan On 2021/05/25 15:37:32, Vladimir Ozerov wrote: > Hi, > > Consider the distributed SQL engine that uses a distribution property to > model exchanges. Consider the following physical tree. To do the > distributed join, we co-locate tuples using the equijoin key. Now the Join > operator has two equivalent distributions - [a1] and [b1]. It is critical > to expose both distributions so that the top Aggregate can take advantage > of the co-location. > > Aggregate[group=b1] > DistributedJoin[a.a1=b.b1] // SHARDED[a1], SHARDED[b1] > Input[a] // SHARDED[a1] > Input[b] // SHARDED[b1] > > A similar example for the Project: > Aggregate[group=$1] > Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1] > Input // SHARDED[a] > > The question is how to model this situation properly? > > First, it seems that RelMultipleTrait and RelCompositeTrait were designed > to handle this situation. However, I couldn't make them work with the > top-down optimizer. The reason is that when we register a RelNode with a > composite trait in MEMO, VolcanoPlanner flattens the composite trait into > the default trait value in RelSet.add -> RelTraitSet.simplify. That is, the > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the > original traits could not be derived in the PhysicalNode.derive methods. > > Second, we may try to model multiple sharding keys in a single trait. But > this complicates the implementation of PhysicalNode.passThrough/derive > significantly. > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]] > > Third, we may expose multiple traits using metadata. RelMdDistribution > would not work, because it exposes only a single distribution. But a custom > handler may potentially fix that. However, it will not be integrated with > the top-down optimizer still, which makes the idea questionable. > > To summarize, it seems that currently there is no easy way to handle > composite traits with a top-down optimizer. I wonder whether someone from > the devlist already solved similar issues in Apache Calcite or other > optimizers. If so, what was the approach or best practices? Intuitively, it > seems that RelMultipleTrait/RelCompositeTrait approach might be the way to > go. But why do we replace the original composite trait set with the default > value in the RelTraitSet.simplify routine? > > Regards, > Vladimir. >
Re: Exposing multiple values of a trait from the operator
Hi Vladimir, Thank you for the link. It is very relevant to my problem. I see that in these discussions, there were several ideas and claims, such as that (1) we can get rid of "simplify" altogether, (2) composite traits are rare in practice, (3) composite traits are not designed well in the first place [1]. I do not have the full picture in my head, so I'll try to share some thoughts to advance the discussion. Regarding (1), I think that the removal of "simplify" may help with my particular (and pretty simple) test but might lead to some unpredictable results for more complicated queries. Suppose that we generate two equivalent nodes with different traits: [a] and [a][b]. Depending on the nature of the trait def, these two nodes might or might belong to the same subset. For example, [a] and [a][b] are different subsets for RelCollcation. At the same time, [a] and [a][b] could belong to the same subset for some distributions. That is, if the input is hash-distributed by either [a] or [b], it might imply that a==b for every tuple (otherwise, hashes will not match), and therefore every RelNode in the RelSet that is shared by [a] is also sharded by [b] and vice verse. The idea is similar to transitive predicates. So ideally, we should let the RelTraitDef define how to compare composite traits with other traits. Otherwise, we may lose some optimization opportunities. Regarding (2), perhaps the multi-collation nodes are really rare in practice. But nodes with multiple hash distributions are widespread for distributed engines. Because in distributed systems, the collocated hash equijoin is the most common way of joining two inputs, and such join always produces an additional distribution. Regarding (3), it would be very interesting to hear suggestions and ideas on the proper design of composite traits. The composite traits mechanics mentioned in RelSubset Javadoc's is not a good design choice for distribution traits. That is, if we have a node that is distributed by [a][b], we cannot just put it into two subsets [a] and [b], because operator parents may require both [a] and [b], otherwise unnecessary exchanges could appear. That is, [a][b] should be propagated together. For example, the removal of SHARDED[a1] from #1 would add the exchange between #2 and #1, and the removal of SHARDED[b1] from #1 would add the exchange between #3 and #2. Neither is optimal. 3: Aggregate[group=b1] 2: Join[a.a1=c.c1] // SHARDED[a1], SHARDED[b1], SHARDED[c1] 1: Join[a.a1=b.b1] // SHARDED[a1], SHARDED[b1] @Haisheng Yuan , following your comment [1], would you mind providing your ideas around the proper design of composite traits? Are composite traits implemented in Orca? Regards, Vladimir. [1] https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=17081984=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17081984 вт, 25 мая 2021 г. в 19:32, Vladimir Sitnikov : > >VolcanoPlanner flattens the composite trait into > the default trait value in RelSet.add -> RelTraitSet.simplify > > Vladimir, have you tried removing that RelTraitSet.simplify? > > I remember I have run into that multiple times already, and I suggested > removing that "simplify". > For example, > > https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=16750377=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16750377 > > > Vladimir >
Re: Exposing multiple values of a trait from the operator
>VolcanoPlanner flattens the composite trait into the default trait value in RelSet.add -> RelTraitSet.simplify Vladimir, have you tried removing that RelTraitSet.simplify? I remember I have run into that multiple times already, and I suggested removing that "simplify". For example, https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=16750377=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16750377 Vladimir
Exposing multiple values of a trait from the operator
Hi, Consider the distributed SQL engine that uses a distribution property to model exchanges. Consider the following physical tree. To do the distributed join, we co-locate tuples using the equijoin key. Now the Join operator has two equivalent distributions - [a1] and [b1]. It is critical to expose both distributions so that the top Aggregate can take advantage of the co-location. Aggregate[group=b1] DistributedJoin[a.a1=b.b1] // SHARDED[a1], SHARDED[b1] Input[a] // SHARDED[a1] Input[b] // SHARDED[b1] A similar example for the Project: Aggregate[group=$1] Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1] Input // SHARDED[a] The question is how to model this situation properly? First, it seems that RelMultipleTrait and RelCompositeTrait were designed to handle this situation. However, I couldn't make them work with the top-down optimizer. The reason is that when we register a RelNode with a composite trait in MEMO, VolcanoPlanner flattens the composite trait into the default trait value in RelSet.add -> RelTraitSet.simplify. That is, the trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the original traits could not be derived in the PhysicalNode.derive methods. Second, we may try to model multiple sharding keys in a single trait. But this complicates the implementation of PhysicalNode.passThrough/derive significantly. SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]] Third, we may expose multiple traits using metadata. RelMdDistribution would not work, because it exposes only a single distribution. But a custom handler may potentially fix that. However, it will not be integrated with the top-down optimizer still, which makes the idea questionable. To summarize, it seems that currently there is no easy way to handle composite traits with a top-down optimizer. I wonder whether someone from the devlist already solved similar issues in Apache Calcite or other optimizers. If so, what was the approach or best practices? Intuitively, it seems that RelMultipleTrait/RelCompositeTrait approach might be the way to go. But why do we replace the original composite trait set with the default value in the RelTraitSet.simplify routine? Regards, Vladimir.