Re: How to proceed with IMPALA-4086 (Benchmark for SimpleScheduler)

2016-11-16 Thread Lars Volker
Thank you Tim for your response. After talking to Marcel in person I filed
https://issues.cloudera.org/browse/IMPALA-4496 to track this effort
separately.

On Tue, Nov 15, 2016 at 4:14 PM, Tim Armstrong 
wrote:

> It sounds like this is a) a lot of work to do initially and b) a lot of
> work to maintain as the thrift data structures evolve.
>
> It seems like benchmarking at that granularity might not be worth the
> hassle. It sounds like the lower-level microbenchmark you've added is maybe
> simpler.
>
> It could be very worthwhile to benchmark the combined planning + scheduling
> process, since that would presumably require less plumbing.
>
> On Fri, Nov 11, 2016 at 4:49 AM, Lars Volker  wrote:
>
> > Hi all,
> >
> > Here is a change  that implements a
> > benchmark for SimpleScheduler::ComputeScanRangeAssigment() to address
> > IMPALA-4086 .
> >
> > I would like to discuss whether it is possible to run the benchmark
> against
> > the Schedule() method instead. This would require changes to the
> scheduler
> > test utility classes in simple-scheduler-test-util.h to create a
> > TQueryExecRequest message suitable for calling Schedule().
> >
> > Currently we compute these fields before calling
> > ComputeScanRangeAssignment(), which are basically what is contained in a
> > single plan node.
> >
> > BackendConfig
> > > vector
> > > vector
> > > TQueryOptions
> >
> >
> > To build a schedule object we need to build a TQueryExecRequest, which
> has
> > 14 fields. The complex ones are:
> >
> > optional Descriptors.TDescriptorTable desc_tbl
> > > optional list fragments
> > > optional list dest_fragment_idx
> > > optional map
> > > per_node_scan_ranges
> > > optional list mt_plan_exec_info
> > > optional Results.TResultSetMetadata result_set_metadata
> > > optional TFinalizeParams finalize_params
> > > required ImpalaInternalService.TQueryCtx query_ctx
> > > optional string query_plan
> > > required list host_list
> > > optional LineageGraph.TLineageGraph lineage_graph
> >
> >
> > Some of these members have other dependencies, for example the fragments
> > have the plan inside, which has all plan nodes:
> >
> > TQueryExecRequest:
> > >  list fragments
> > >   partition.type
> > >   plan.nodes[node_id]
> > >node_id (for dcheck)
> > >node.hdfs_scan_node (can be unset)
> > >   idx (for sorting in query-schedule)
> > >  TQueryCtx query_ctx (only for query options, which we already have)
> >
> >
> > I think it makes sense to benchmark ComputeScanRangeAssignment() in
> > isolation, since its implementation is reasonably complex, i.e. not just
> > linear in the input size. In order to benchmark Schedule(), we should
> first
> > consider writing proper unit tests for the SimpleScheduler and extend the
> > test utility code where necessary to do so.
> >
> > I curious for any feedback. Thanks, Lars
> >
>


Re: How to proceed with IMPALA-4086 (Benchmark for SimpleScheduler)

2016-11-15 Thread Tim Armstrong
It sounds like this is a) a lot of work to do initially and b) a lot of
work to maintain as the thrift data structures evolve.

It seems like benchmarking at that granularity might not be worth the
hassle. It sounds like the lower-level microbenchmark you've added is maybe
simpler.

It could be very worthwhile to benchmark the combined planning + scheduling
process, since that would presumably require less plumbing.

On Fri, Nov 11, 2016 at 4:49 AM, Lars Volker  wrote:

> Hi all,
>
> Here is a change  that implements a
> benchmark for SimpleScheduler::ComputeScanRangeAssigment() to address
> IMPALA-4086 .
>
> I would like to discuss whether it is possible to run the benchmark against
> the Schedule() method instead. This would require changes to the scheduler
> test utility classes in simple-scheduler-test-util.h to create a
> TQueryExecRequest message suitable for calling Schedule().
>
> Currently we compute these fields before calling
> ComputeScanRangeAssignment(), which are basically what is contained in a
> single plan node.
>
> BackendConfig
> > vector
> > vector
> > TQueryOptions
>
>
> To build a schedule object we need to build a TQueryExecRequest, which has
> 14 fields. The complex ones are:
>
> optional Descriptors.TDescriptorTable desc_tbl
> > optional list fragments
> > optional list dest_fragment_idx
> > optional map
> > per_node_scan_ranges
> > optional list mt_plan_exec_info
> > optional Results.TResultSetMetadata result_set_metadata
> > optional TFinalizeParams finalize_params
> > required ImpalaInternalService.TQueryCtx query_ctx
> > optional string query_plan
> > required list host_list
> > optional LineageGraph.TLineageGraph lineage_graph
>
>
> Some of these members have other dependencies, for example the fragments
> have the plan inside, which has all plan nodes:
>
> TQueryExecRequest:
> >  list fragments
> >   partition.type
> >   plan.nodes[node_id]
> >node_id (for dcheck)
> >node.hdfs_scan_node (can be unset)
> >   idx (for sorting in query-schedule)
> >  TQueryCtx query_ctx (only for query options, which we already have)
>
>
> I think it makes sense to benchmark ComputeScanRangeAssignment() in
> isolation, since its implementation is reasonably complex, i.e. not just
> linear in the input size. In order to benchmark Schedule(), we should first
> consider writing proper unit tests for the SimpleScheduler and extend the
> test utility code where necessary to do so.
>
> I curious for any feedback. Thanks, Lars
>


How to proceed with IMPALA-4086 (Benchmark for SimpleScheduler)

2016-11-11 Thread Lars Volker
Hi all,

Here is a change  that implements a
benchmark for SimpleScheduler::ComputeScanRangeAssigment() to address
IMPALA-4086 .

I would like to discuss whether it is possible to run the benchmark against
the Schedule() method instead. This would require changes to the scheduler
test utility classes in simple-scheduler-test-util.h to create a
TQueryExecRequest message suitable for calling Schedule().

Currently we compute these fields before calling
ComputeScanRangeAssignment(), which are basically what is contained in a
single plan node.

BackendConfig
> vector
> vector
> TQueryOptions


To build a schedule object we need to build a TQueryExecRequest, which has
14 fields. The complex ones are:

optional Descriptors.TDescriptorTable desc_tbl
> optional list fragments
> optional list dest_fragment_idx
> optional map
> per_node_scan_ranges
> optional list mt_plan_exec_info
> optional Results.TResultSetMetadata result_set_metadata
> optional TFinalizeParams finalize_params
> required ImpalaInternalService.TQueryCtx query_ctx
> optional string query_plan
> required list host_list
> optional LineageGraph.TLineageGraph lineage_graph


Some of these members have other dependencies, for example the fragments
have the plan inside, which has all plan nodes:

TQueryExecRequest:
>  list fragments
>   partition.type
>   plan.nodes[node_id]
>node_id (for dcheck)
>node.hdfs_scan_node (can be unset)
>   idx (for sorting in query-schedule)
>  TQueryCtx query_ctx (only for query options, which we already have)


I think it makes sense to benchmark ComputeScanRangeAssignment() in
isolation, since its implementation is reasonably complex, i.e. not just
linear in the input size. In order to benchmark Schedule(), we should first
consider writing proper unit tests for the SimpleScheduler and extend the
test utility code where necessary to do so.

I curious for any feedback. Thanks, Lars