Drill guys: What do you think of the proposal? On Feb 11, 2015, at 11:34 AM, Ashutosh Chauhan <[email protected]> wrote:
Overall proposal sounds good to me. +1 On Tue, Feb 10, 2015 at 3:35 PM, Julian Hyde <[email protected]> wrote: I've had some discussions about adding an Exchange operator and Distribution trait to Hive's cost-based optimizer, which uses Calcite. Ashutosh has logged a bug [ https://issues.apache.org/jira/browse/CALCITE-594 ] and pull request containing a proof-of-concept [ https://github.com/apache/incubator-calcite/pull/52/files ]. I know that Drill has a Distribution trait and several sub-classes of Exchange operator (DrillDistributionTrait, ExchangePrel, BroadcastExchangePrel, HashToMergeExchangePrel, HashToRandomExchangePrel, OrderedPartitionExchangePrel and SimpleMergeExchangePrel, in https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical ) I propose to create a Distribution trait and Exchange operator base class in Calcite, with the goal that both Drill and Hive would use them. (I am adopting Drill terminology -- Distribution rather than Partition, Exchange rather than Shuffle -- but I am pretty sure that the concepts are the same.) public abstract class Exchange extends SingleRel { public final RelDistribution distribution; protected Exchange(RelCluster cluster, RelTraitSet traitSet, RelNode input, RelDistribution distribution) { super(cluster, traitSet, input); this.distribution = distribution; } } public interface RelDistribution extends RelMultipleTrait { enum DistributionType { SINGLETON, HASH_DISTRIBUTED, RANGE_DISTRIBUTED, RANDOM_DISTRIBUTED, ROUND_ROBIN_DISTRIBUTED, BROADCAST_DISTRIBUTED } public DistributionType getType(); public ImmutableIntList getFields(); } Calcite would not contain any particular exchange algorithms. However, since it is common to combine sort and exchange, I would create a base class for it: public abstract class SortExchange extends Exchange { public final Collation collation; ... } The physical operators would remain in Drill/Hive and would likely be fully specified by the distribution and collation; they would not need any additional attributes. We would not be able to port DrillDistributionTraitDef.convert directly -- it would create a LogicalExchange (analogous to how RelCollationTraitDef.convert creates a LogicalSort) and then Drill rules would need to kick in to convert that to HashToRandomExchangePrel etc. I do not think that RelDistribution needs to be a "multiple" trait (compare with RelCollation extends RelMultipleTrait, which allows a RelNode to have more than one sort-order) but I may be wrong. The advantages of making Exchange a first-class operator and Distribution a trait are clear. We will be able to build a library of rules (e.g. FilterExchangePushRule, ExchangeRemoveRule), a RelMdDistribution metadata interface, and start working on stats and cost model. Drill and Hive stakeholders, please let me know what you think of this plan. Julian
