If the question is, will exchange carry info about the range of the partition then I guess not; Hive doesn’t have plan to use it that way (for the time being).
On 2/11/15, 3:39 PM, "Jacques Nadeau" <[email protected]> wrote: >Is there any proposal for partition range (hash or ordered) awareness? > >On Wed, Feb 11, 2015 at 3:29 PM, John Pullokkaran < >[email protected]> wrote: > >> Hive uses greedy Join Order Algorithm (LoptOptimizeJoinRule). >> We are thinking of dividing the join graph to sub graphs to work around >> scalability issues (if it arises). >> >> I thought in VolcanoPlanner you could specify a time bound. >> >> >> On 2/11/15, 3:23 PM, "Jinfeng Ni" <[email protected]> wrote: >> >> >About John's comment about put bounds in the plan search space, does >> >Calcite allow us to specify some bounds in the planner, and stop the >> >searching with the best plan found so far after that bounds are meet? >> > >> >AFAIK, in TpchTest, if I turn on "calcite.test.slow", then some queries >> >like TPCH Q5, Q7, Q8 seem to not come back with a plan, after several >> >minutes, when run on my laptop. If Calcite has the ability to specify >>the >> >search bounds ( say # of rules are fired, or # of possible plans >> >enumerated), then it should return a plan within a reasonable amount of >> >time, in stead of keeping on searching, searching, and searching, and >> >possibly never end. >> > >> > >> > >> >On Wed, Feb 11, 2015 at 2:49 PM, Julian Hyde <[email protected]> >> wrote: >> > >> >> Aman, >> >> >> >> RelDistribution will be an interface, and there¹s no reason why Drill >> >> shouldn¹t have its own values or even sub-classes. As long as >> >> RelDistributionTraitDef is able to canonize them. So you could, for >> >> instance, sub-class "Hash[1, 3]² and specify which hash function is >> >>being >> >> used. >> >> >> >> I¹ve addressed the comment about logical exchange already ‹ you can >>go >> >> straight to physical. >> >> >> >> >> >> On Feb 11, 2015, at 2:34 PM, Aman Sinha <[email protected]> wrote: >> >> >> >> > I am neutral on this for now until we give it more thought. The >> >>reason >> >> > being that since Calcite is not aware of the execution engine's >> >> capability >> >> > and configuration parameters for distribution (e.g Drill has a few >> >> > parameters, including just true/false type of flags that determine >> >> whether >> >> > or not an Exchange node is even inserted in the plan and if it is >> >>used, >> >> > what type of Exchange it is etc.). In that sense, if the logical >>plan >> >> > produced by Calcite contains a LogicalExchange, it is possible that >> >>Drill >> >> > may not be able use it directly while building the physical plan. >> >> > >> >> > I do however see the benefits in terms of trait propagation, >>combining >> >> > distribution and collation traits and consolidating the subsumption >> >>logic >> >> > in some base class such that it is useful for other consumers of >> >>Calcite. >> >> > >> >> > Aman >> >> > >> >> > On Wed, Feb 11, 2015 at 2:21 PM, Jinfeng Ni <[email protected]> >> >> wrote: >> >> > >> >> >> Drill currently do query planing in two phases : 1) logical >> >>planning, >> >> >> which handles join order, logical filter/project push down etc, >>and >> >>2) >> >> >> physical planning, which makes decision between different physical >> >> >> operators ( different join / aggregation method), filter/project >>push >> >> down >> >> >> (storage-specific rule), and insert EXCHANGE. Part of the >>reason to >> >> put >> >> >> into two phases is when the two phases are merged together, the >> >>planning >> >> >> time is increased significantly ( since the planner need to >>enumerate >> >> >> different join orders, multiplied by different choices of >>EXCHANGE). >> >> >> >> >> >> The new rules that you are proposing seems to want to build plan >>in >> >>one >> >> >> single logical planing phase. I'm not sure how it will impact the >> >> overall >> >> >> planning time. >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Feb 11, 2015 at 1:38 PM, Jinfeng Ni >><[email protected]> >> >> wrote: >> >> >> >> >> >>> I think it's a good proposal to put Exchange/Distribution into >> >>Calcite >> >> >>> library. >> >> >>> >> >> >>> Make sense to me. +1 >> >> >>> >> >> >>> >> >> >>> >> >> >>> On Wed, Feb 11, 2015 at 12:45 PM, Julian Hyde <[email protected]> >> >> wrote: >> >> >>> >> >> >>>> Drill guys: What do you think of the proposal? >> >> >>>> >> >> >>>> On Feb 11, 2015, at 11:34 AM, Ashutosh Chauhan >> >><[email protected]> >> >> >>>> wrote: >> >> >>>> >> >> >>>> Overall proposal sounds good to me. +1 >> >> >>>> >> >> >>>> On Tue, Feb 10, 2015 at 3:35 PM, Julian Hyde <[email protected]> >> >> wrote: >> >> >>>> >> >> >>>> I've had some discussions about adding an Exchange operator and >> >> >>>> Distribution trait to Hive's cost-based optimizer, which uses >> >>Calcite. >> >> >>>> Ashutosh has logged a bug [ >> >> >>>> https://issues.apache.org/jira/browse/CALCITE-594 ] and pull >> >>request >> >> >>>> containing a proof-of-concept [ >> >> >>>> https://github.com/apache/incubator-calcite/pull/52/files ]. >> >> >>>> >> >> >>>> I know that Drill has a Distribution trait and several >>sub-classes >> >>of >> >> >>>> Exchange operator (DrillDistributionTrait, ExchangePrel, >> >> >>>> BroadcastExchangePrel, HashToMergeExchangePrel, >> >> >> HashToRandomExchangePrel, >> >> >>>> OrderedPartitionExchangePrel and SimpleMergeExchangePrel, in >> >> >>>> >> >> >>>> >> >> >>>> >> >> >> >> >> >> >> >> >>https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/ >> >>org/apache/drill/exec/planner/physical >> >> >>>> ) >> >> >>>> >> >> >>>> I propose to create a Distribution trait and Exchange operator >>base >> >> >> class >> >> >>>> in Calcite, with the goal that both Drill and Hive would use >>them. >> >>(I >> >> am >> >> >>>> adopting Drill terminology -- Distribution rather than >>Partition, >> >> >> Exchange >> >> >>>> rather than Shuffle -- but I am pretty sure that the concepts >>are >> >>the >> >> >>>> same.) >> >> >>>> >> >> >>>> public abstract class Exchange extends SingleRel { >> >> >>>> public final RelDistribution distribution; >> >> >>>> >> >> >>>> protected Exchange(RelCluster cluster, RelTraitSet traitSet, >> >>RelNode >> >> >>>> input, RelDistribution distribution) { >> >> >>>> super(cluster, traitSet, input); >> >> >>>> this.distribution = distribution; >> >> >>>> } >> >> >>>> } >> >> >>>> >> >> >>>> public interface RelDistribution extends RelMultipleTrait { >> >> >>>> enum DistributionType { >> >> >>>> SINGLETON, >> >> >>>> HASH_DISTRIBUTED, >> >> >>>> RANGE_DISTRIBUTED, >> >> >>>> RANDOM_DISTRIBUTED, >> >> >>>> ROUND_ROBIN_DISTRIBUTED, >> >> >>>> BROADCAST_DISTRIBUTED >> >> >>>> } >> >> >>>> >> >> >>>> public DistributionType getType(); >> >> >>>> public ImmutableIntList getFields(); >> >> >>>> } >> >> >>>> >> >> >>>> Calcite would not contain any particular exchange algorithms. >> >>However, >> >> >>>> since it is common to combine sort and exchange, I would create >>a >> >>base >> >> >>>> class for it: >> >> >>>> >> >> >>>> public abstract class SortExchange extends Exchange { >> >> >>>> public final Collation collation; >> >> >>>> >> >> >>>> ... >> >> >>>> } >> >> >>>> >> >> >>>> The physical operators would remain in Drill/Hive and would >>likely >> >>be >> >> >>>> fully >> >> >>>> specified by the distribution and collation; they would not need >> >>any >> >> >>>> additional attributes. We would not be able to port >> >> >>>> DrillDistributionTraitDef.convert directly -- it would create a >> >> >>>> LogicalExchange (analogous to how RelCollationTraitDef.convert >> >> creates a >> >> >>>> LogicalSort) and then Drill rules would need to kick in to >>convert >> >> that >> >> >> to >> >> >>>> HashToRandomExchangePrel etc. >> >> >>>> >> >> >>>> I do not think that RelDistribution needs to be a "multiple" >>trait >> >> >>>> (compare >> >> >>>> with RelCollation extends RelMultipleTrait, which allows a >>RelNode >> >>to >> >> >> have >> >> >>>> more than one sort-order) but I may be wrong. >> >> >>>> >> >> >>>> The advantages of making Exchange a first-class operator and >> >> >> Distribution >> >> >>>> a >> >> >>>> trait are clear. We will be able to build a library of rules >>(e.g. >> >> >>>> FilterExchangePushRule, ExchangeRemoveRule), a RelMdDistribution >> >> >> metadata >> >> >>>> interface, and start working on stats and cost model. >> >> >>>> >> >> >>>> Drill and Hive stakeholders, please let me know what you think >>of >> >>this >> >> >>>> plan. >> >> >>>> >> >> >>>> Julian >> >> >>>> >> >> >>> >> >> >>> >> >> >> >> >> >> >> >> >>
