Aman, RelDistribution will be an interface, and there’s no reason why Drill shouldn’t have its own values or even sub-classes. As long as RelDistributionTraitDef is able to canonize them. So you could, for instance, sub-class "Hash[1, 3]” and specify which hash function is being used.
I’ve addressed the comment about logical exchange already — you can go straight to physical. On Feb 11, 2015, at 2:34 PM, Aman Sinha <[email protected]> wrote: > I am neutral on this for now until we give it more thought. The reason > being that since Calcite is not aware of the execution engine's capability > and configuration parameters for distribution (e.g Drill has a few > parameters, including just true/false type of flags that determine whether > or not an Exchange node is even inserted in the plan and if it is used, > what type of Exchange it is etc.). In that sense, if the logical plan > produced by Calcite contains a LogicalExchange, it is possible that Drill > may not be able use it directly while building the physical plan. > > I do however see the benefits in terms of trait propagation, combining > distribution and collation traits and consolidating the subsumption logic > in some base class such that it is useful for other consumers of Calcite. > > Aman > > On Wed, Feb 11, 2015 at 2:21 PM, Jinfeng Ni <[email protected]> wrote: > >> Drill currently do query planing in two phases : 1) logical planning, >> which handles join order, logical filter/project push down etc, and 2) >> physical planning, which makes decision between different physical >> operators ( different join / aggregation method), filter/project push down >> (storage-specific rule), and insert EXCHANGE. Part of the reason to put >> into two phases is when the two phases are merged together, the planning >> time is increased significantly ( since the planner need to enumerate >> different join orders, multiplied by different choices of EXCHANGE). >> >> The new rules that you are proposing seems to want to build plan in one >> single logical planing phase. I'm not sure how it will impact the overall >> planning time. >> >> >> >> On Wed, Feb 11, 2015 at 1:38 PM, Jinfeng Ni <[email protected]> wrote: >> >>> I think it's a good proposal to put Exchange/Distribution into Calcite >>> library. >>> >>> Make sense to me. +1 >>> >>> >>> >>> On Wed, Feb 11, 2015 at 12:45 PM, Julian Hyde <[email protected]> wrote: >>> >>>> Drill guys: What do you think of the proposal? >>>> >>>> On Feb 11, 2015, at 11:34 AM, Ashutosh Chauhan <[email protected]> >>>> wrote: >>>> >>>> Overall proposal sounds good to me. +1 >>>> >>>> On Tue, Feb 10, 2015 at 3:35 PM, Julian Hyde <[email protected]> wrote: >>>> >>>> I've had some discussions about adding an Exchange operator and >>>> Distribution trait to Hive's cost-based optimizer, which uses Calcite. >>>> Ashutosh has logged a bug [ >>>> https://issues.apache.org/jira/browse/CALCITE-594 ] and pull request >>>> containing a proof-of-concept [ >>>> https://github.com/apache/incubator-calcite/pull/52/files ]. >>>> >>>> I know that Drill has a Distribution trait and several sub-classes of >>>> Exchange operator (DrillDistributionTrait, ExchangePrel, >>>> BroadcastExchangePrel, HashToMergeExchangePrel, >> HashToRandomExchangePrel, >>>> OrderedPartitionExchangePrel and SimpleMergeExchangePrel, in >>>> >>>> >>>> >> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical >>>> ) >>>> >>>> I propose to create a Distribution trait and Exchange operator base >> class >>>> in Calcite, with the goal that both Drill and Hive would use them. (I am >>>> adopting Drill terminology -- Distribution rather than Partition, >> Exchange >>>> rather than Shuffle -- but I am pretty sure that the concepts are the >>>> same.) >>>> >>>> public abstract class Exchange extends SingleRel { >>>> public final RelDistribution distribution; >>>> >>>> protected Exchange(RelCluster cluster, RelTraitSet traitSet, RelNode >>>> input, RelDistribution distribution) { >>>> super(cluster, traitSet, input); >>>> this.distribution = distribution; >>>> } >>>> } >>>> >>>> public interface RelDistribution extends RelMultipleTrait { >>>> enum DistributionType { >>>> SINGLETON, >>>> HASH_DISTRIBUTED, >>>> RANGE_DISTRIBUTED, >>>> RANDOM_DISTRIBUTED, >>>> ROUND_ROBIN_DISTRIBUTED, >>>> BROADCAST_DISTRIBUTED >>>> } >>>> >>>> public DistributionType getType(); >>>> public ImmutableIntList getFields(); >>>> } >>>> >>>> Calcite would not contain any particular exchange algorithms. However, >>>> since it is common to combine sort and exchange, I would create a base >>>> class for it: >>>> >>>> public abstract class SortExchange extends Exchange { >>>> public final Collation collation; >>>> >>>> ... >>>> } >>>> >>>> The physical operators would remain in Drill/Hive and would likely be >>>> fully >>>> specified by the distribution and collation; they would not need any >>>> additional attributes. We would not be able to port >>>> DrillDistributionTraitDef.convert directly -- it would create a >>>> LogicalExchange (analogous to how RelCollationTraitDef.convert creates a >>>> LogicalSort) and then Drill rules would need to kick in to convert that >> to >>>> HashToRandomExchangePrel etc. >>>> >>>> I do not think that RelDistribution needs to be a "multiple" trait >>>> (compare >>>> with RelCollation extends RelMultipleTrait, which allows a RelNode to >> have >>>> more than one sort-order) but I may be wrong. >>>> >>>> The advantages of making Exchange a first-class operator and >> Distribution >>>> a >>>> trait are clear. We will be able to build a library of rules (e.g. >>>> FilterExchangePushRule, ExchangeRemoveRule), a RelMdDistribution >> metadata >>>> interface, and start working on stats and cost model. >>>> >>>> Drill and Hive stakeholders, please let me know what you think of this >>>> plan. >>>> >>>> Julian >>>> >>> >>> >>
