Is there any proposal for partition range (hash or ordered) awareness? On Wed, Feb 11, 2015 at 3:29 PM, John Pullokkaran < [email protected]> wrote:
> Hive uses greedy Join Order Algorithm (LoptOptimizeJoinRule). > We are thinking of dividing the join graph to sub graphs to work around > scalability issues (if it arises). > > I thought in VolcanoPlanner you could specify a time bound. > > > On 2/11/15, 3:23 PM, "Jinfeng Ni" <[email protected]> wrote: > > >About John's comment about put bounds in the plan search space, does > >Calcite allow us to specify some bounds in the planner, and stop the > >searching with the best plan found so far after that bounds are meet? > > > >AFAIK, in TpchTest, if I turn on "calcite.test.slow", then some queries > >like TPCH Q5, Q7, Q8 seem to not come back with a plan, after several > >minutes, when run on my laptop. If Calcite has the ability to specify the > >search bounds ( say # of rules are fired, or # of possible plans > >enumerated), then it should return a plan within a reasonable amount of > >time, in stead of keeping on searching, searching, and searching, and > >possibly never end. > > > > > > > >On Wed, Feb 11, 2015 at 2:49 PM, Julian Hyde <[email protected]> > wrote: > > > >> Aman, > >> > >> RelDistribution will be an interface, and there¹s no reason why Drill > >> shouldn¹t have its own values or even sub-classes. As long as > >> RelDistributionTraitDef is able to canonize them. So you could, for > >> instance, sub-class "Hash[1, 3]² and specify which hash function is > >>being > >> used. > >> > >> I¹ve addressed the comment about logical exchange already ‹ you can go > >> straight to physical. > >> > >> > >> On Feb 11, 2015, at 2:34 PM, Aman Sinha <[email protected]> wrote: > >> > >> > I am neutral on this for now until we give it more thought. The > >>reason > >> > being that since Calcite is not aware of the execution engine's > >> capability > >> > and configuration parameters for distribution (e.g Drill has a few > >> > parameters, including just true/false type of flags that determine > >> whether > >> > or not an Exchange node is even inserted in the plan and if it is > >>used, > >> > what type of Exchange it is etc.). In that sense, if the logical plan > >> > produced by Calcite contains a LogicalExchange, it is possible that > >>Drill > >> > may not be able use it directly while building the physical plan. > >> > > >> > I do however see the benefits in terms of trait propagation, combining > >> > distribution and collation traits and consolidating the subsumption > >>logic > >> > in some base class such that it is useful for other consumers of > >>Calcite. > >> > > >> > Aman > >> > > >> > On Wed, Feb 11, 2015 at 2:21 PM, Jinfeng Ni <[email protected]> > >> wrote: > >> > > >> >> Drill currently do query planing in two phases : 1) logical > >>planning, > >> >> which handles join order, logical filter/project push down etc, and > >>2) > >> >> physical planning, which makes decision between different physical > >> >> operators ( different join / aggregation method), filter/project push > >> down > >> >> (storage-specific rule), and insert EXCHANGE. Part of the reason to > >> put > >> >> into two phases is when the two phases are merged together, the > >>planning > >> >> time is increased significantly ( since the planner need to enumerate > >> >> different join orders, multiplied by different choices of EXCHANGE). > >> >> > >> >> The new rules that you are proposing seems to want to build plan in > >>one > >> >> single logical planing phase. I'm not sure how it will impact the > >> overall > >> >> planning time. > >> >> > >> >> > >> >> > >> >> On Wed, Feb 11, 2015 at 1:38 PM, Jinfeng Ni <[email protected]> > >> wrote: > >> >> > >> >>> I think it's a good proposal to put Exchange/Distribution into > >>Calcite > >> >>> library. > >> >>> > >> >>> Make sense to me. +1 > >> >>> > >> >>> > >> >>> > >> >>> On Wed, Feb 11, 2015 at 12:45 PM, Julian Hyde <[email protected]> > >> wrote: > >> >>> > >> >>>> Drill guys: What do you think of the proposal? > >> >>>> > >> >>>> On Feb 11, 2015, at 11:34 AM, Ashutosh Chauhan > >><[email protected]> > >> >>>> wrote: > >> >>>> > >> >>>> Overall proposal sounds good to me. +1 > >> >>>> > >> >>>> On Tue, Feb 10, 2015 at 3:35 PM, Julian Hyde <[email protected]> > >> wrote: > >> >>>> > >> >>>> I've had some discussions about adding an Exchange operator and > >> >>>> Distribution trait to Hive's cost-based optimizer, which uses > >>Calcite. > >> >>>> Ashutosh has logged a bug [ > >> >>>> https://issues.apache.org/jira/browse/CALCITE-594 ] and pull > >>request > >> >>>> containing a proof-of-concept [ > >> >>>> https://github.com/apache/incubator-calcite/pull/52/files ]. > >> >>>> > >> >>>> I know that Drill has a Distribution trait and several sub-classes > >>of > >> >>>> Exchange operator (DrillDistributionTrait, ExchangePrel, > >> >>>> BroadcastExchangePrel, HashToMergeExchangePrel, > >> >> HashToRandomExchangePrel, > >> >>>> OrderedPartitionExchangePrel and SimpleMergeExchangePrel, in > >> >>>> > >> >>>> > >> >>>> > >> >> > >> > >> > https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/ > >>org/apache/drill/exec/planner/physical > >> >>>> ) > >> >>>> > >> >>>> I propose to create a Distribution trait and Exchange operator base > >> >> class > >> >>>> in Calcite, with the goal that both Drill and Hive would use them. > >>(I > >> am > >> >>>> adopting Drill terminology -- Distribution rather than Partition, > >> >> Exchange > >> >>>> rather than Shuffle -- but I am pretty sure that the concepts are > >>the > >> >>>> same.) > >> >>>> > >> >>>> public abstract class Exchange extends SingleRel { > >> >>>> public final RelDistribution distribution; > >> >>>> > >> >>>> protected Exchange(RelCluster cluster, RelTraitSet traitSet, > >>RelNode > >> >>>> input, RelDistribution distribution) { > >> >>>> super(cluster, traitSet, input); > >> >>>> this.distribution = distribution; > >> >>>> } > >> >>>> } > >> >>>> > >> >>>> public interface RelDistribution extends RelMultipleTrait { > >> >>>> enum DistributionType { > >> >>>> SINGLETON, > >> >>>> HASH_DISTRIBUTED, > >> >>>> RANGE_DISTRIBUTED, > >> >>>> RANDOM_DISTRIBUTED, > >> >>>> ROUND_ROBIN_DISTRIBUTED, > >> >>>> BROADCAST_DISTRIBUTED > >> >>>> } > >> >>>> > >> >>>> public DistributionType getType(); > >> >>>> public ImmutableIntList getFields(); > >> >>>> } > >> >>>> > >> >>>> Calcite would not contain any particular exchange algorithms. > >>However, > >> >>>> since it is common to combine sort and exchange, I would create a > >>base > >> >>>> class for it: > >> >>>> > >> >>>> public abstract class SortExchange extends Exchange { > >> >>>> public final Collation collation; > >> >>>> > >> >>>> ... > >> >>>> } > >> >>>> > >> >>>> The physical operators would remain in Drill/Hive and would likely > >>be > >> >>>> fully > >> >>>> specified by the distribution and collation; they would not need > >>any > >> >>>> additional attributes. We would not be able to port > >> >>>> DrillDistributionTraitDef.convert directly -- it would create a > >> >>>> LogicalExchange (analogous to how RelCollationTraitDef.convert > >> creates a > >> >>>> LogicalSort) and then Drill rules would need to kick in to convert > >> that > >> >> to > >> >>>> HashToRandomExchangePrel etc. > >> >>>> > >> >>>> I do not think that RelDistribution needs to be a "multiple" trait > >> >>>> (compare > >> >>>> with RelCollation extends RelMultipleTrait, which allows a RelNode > >>to > >> >> have > >> >>>> more than one sort-order) but I may be wrong. > >> >>>> > >> >>>> The advantages of making Exchange a first-class operator and > >> >> Distribution > >> >>>> a > >> >>>> trait are clear. We will be able to build a library of rules (e.g. > >> >>>> FilterExchangePushRule, ExchangeRemoveRule), a RelMdDistribution > >> >> metadata > >> >>>> interface, and start working on stats and cost model. > >> >>>> > >> >>>> Drill and Hive stakeholders, please let me know what you think of > >>this > >> >>>> plan. > >> >>>> > >> >>>> Julian > >> >>>> > >> >>> > >> >>> > >> >> > >> > >> > >
