Re: how to write a better performance rule?

weijie tong Mon, 09 Jan 2017 18:39:33 -0800

Thanks Julian for your reply, I would be glad to contribute to Calcite ,
for now ,I am confused about Drill or Calcite 's VolcanoPlanner. Through my
experience with Drill ,I find that  costed based query planning is
preferred but with bad performance. Some simple query sql took more time to
generate the physical plan which even took more time than the actual
execution time.


So I am wondering that why Drill let the volcano be the only choice to
generate the physical and logical plan in the code of DefaultSqlHandler and
is there any tuning aspect or innovation plan  to the VolcanoPlanner to cut
down the time.


On Tue, Jan 10, 2017 at 2:40 AM, Julian Hyde <[email protected]> wrote:

> The problem is probably not the rule (individual rules tend to run very
> quickly) but the cost of a planner phase that has a lot of rules enabled
> and thus collectively they generate a lot of possibilities. Planning
> complexity is a dragon that is difficult to slay.
>
> One trick is to use the HepPlanner to apply a small set of rules, without
> reference to cost.
>
> Are you intending to contribute the rule to Calcite?
>
> Julian
>
>
>
> > On Jan 9, 2017, at 12:46 AM, weijie tong <[email protected]>
> wrote:
> >
> > hi all:
> >  I have written an Drill rule to push count(distinct ) query down to
> Druid
> > , but it took more than 2~7 seconds to  go through the Volcano Physical
> > Planning.I want to know is there any tips to notice to write a better
> > performance rule ?
> >  I have implemented two rules:
> > 1st:
> >
> > agg->agg->scan
> >
> > agg->agg->project->scan
> >
> >
> > 2st:
> >
> > agg->project->scan
> >
> > agg->scan
> >
> >
> >
> > The query sql is :
> >
> > select count(distinct position_title) from testDruid.employee where
> > birth_date='1961-09-24' group by employee_id
> >
> > the Initial phase output is:
> >
> > LogicalProject(EXPR$0=[$1]): rowcount = 1.5, cumulative cost = {133.1875
> > rows, 232.5 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 9
> >  LogicalAggregate(group=[{0}], EXPR$0=[COUNT(DISTINCT $1)]): rowcount =
> > 1.5, cumulative cost = {131.6875 rows, 231.0 cpu, 0.0 io, 0.0 network,
> 0.0
> > memory}, id = 8
> >    LogicalProject(employee_id=[$4], position_title=[$14]): rowcount =
> > 15.0, cumulative cost = {130.0 rows, 231.0 cpu, 0.0 io, 0.0 network, 0.0
> > memory}, id = 7
> >      LogicalFilter(condition=[=($0, '1961-09-24')]): rowcount = 15.0,
> > cumulative cost = {115.0 rows, 201.0 cpu, 0.0 io, 0.0 network, 0.0
> memory},
> > id = 6
> >        EnumerableTableScan(table=[[testDruid, employee]]): rowcount =
> > 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io, 0.0 network, 0.0
> > memory}, id = 5
> >
> > the physical volcano output is:
> >
> > 15:59:58.201 [278cbd08-f2c6-306d-e78e-b479d088cd5d:foreman] DEBUG
> > o.a.d.e.p.s.h.DefaultSqlHandler - VOLCANO:Physical Planning (7315ms):
> > ScreenPrel: rowcount = 2500.0, cumulative cost = {77750.0 rows, 970250.0
> > cpu, 0.0 io, 2.1504E8 network, 440000.00000000006 memory}, id = 115180
> >  UnionExchangePrel: rowcount = 2500.0, cumulative cost = {77500.0 rows,
> > 970000.0 cpu, 0.0 io, 2.1504E8 network, 440000.00000000006 memory}, id =
> > 115179
> >    ProjectPrel(EXPR$0=[$1]): rowcount = 2500.0, cumulative cost =
> {75000.0
> > rows, 950000.0 cpu, 0.0 io, 2.048E8 network, 440000.00000000006 memory},
> id
> > = 115178
> >      HashAggPrel(group=[{0}], EXPR$0=[$SUM0($1)]): rowcount = 2500.0,
> > cumulative cost = {75000.0 rows, 950000.0 cpu, 0.0 io, 2.048E8 network,
> > 440000.00000000006 memory}, id = 115177
> >        HashToRandomExchangePrel(dist0=[[$0]]): rowcount = 25000.0,
> > cumulative cost = {50000.0 rows, 450000.0 cpu, 0.0 io, 2.048E8 network,
> 0.0
> > memory}, id = 115176
> >          ProjectPrel(employee_id=[$0], EXPR$0=[$1]): rowcount = 25000.0,
> > cumulative cost = {25000.0 rows, 50000.0 cpu, 0.0 io, 0.0 network, 0.0
> > memory}, id = 115175
> >
> > ScanPrel(groupscan=[DruidGroupScan{groupScanSpec=
> DruidGroupScanSpec{dataSource='employee',
> > druidQueryConvertState=MODIFIED, filter=null, allFiltersConverted=true,
> > hasMultiValueFilters=false, groupbyColumns=[employee_id, position_title],
> > query=StreamingGroupByQuery{groupByQuery=GroupByQuery{
> dataSource='employee',
> > querySegmentSpec=LegacySegmentSpec{intervals=[
> 1000-01-01T00:00:00.000+08:05:43/3000-01-01T00:00:00.000+08:00]},
> > limitSpec=NoopLimitSpec, dimFilter=null, granularity=AllGranularity,
> > dimensions=[DefaultDimensionSpec{dimension='employee_id',
> > outputName='employee_id'}],
> > aggregatorSpecs=[DistinctCountAggregatorFactory{name='EXPR$0',
> > fieldName='position_title'}], postAggregatorSpecs=[], havingSpec=null},
> > batchSize=50000}}, columns=[`employee_id`, `EXPR$0`],
> > segmentIntervals=[1961-09-24T00:00:00.000+08:00/1961-09-
> 25T00:00:00.000+08:00]}]):
> > rowcount = 25000.0, cumulative cost = {25000.0 rows, 50000.0 cpu, 0.0 io,
> > 0.0 network, 0.0 memory}, id = 115081
>
>

Re: how to write a better performance rule?

Reply via email to