Re: CTAS plan showing single node?

2016-01-21 Thread Jason Altekruse
The query plans can indicate if a query is parallelized, by looking for
exchanges, which are used to merge work from multiple execution fragments,
or to re-distribute data for an operation. Execution fragments can run on
different threads or different machines. The best place to find out how
queries are executing are the query profiles available in the Web UI under
the "profiles" tab. Here the list of Major fragments will include which
hostnames the fragment was run on, for queries with enough data volume that
parallelization will improve query performance, you should see more than
one host listed.

Please see this section of the docs for more info on how to tune your drill
queries: https://drill.apache.org/docs/performance-tuning-introduction/

On Thu, Jan 21, 2016 at 9:00 AM, Matt  wrote:

> Running a CTAS from csv files in a 4 node HDFS cluster into a Parquet
> file, and I note the physical plan in the Drill UI references scans of all
> the csv sources on a single node.
>
> collectl implies read and write IO on all 4 nodes - does this imply that
> the full cluster is used for scanning the source files, or does my
> configuration have the reads pinned to a single node?
>
> Nodes in cluster: es0{5-8}:
>
> Scan(groupscan=[EasyGroupScan
> [selectionRoot=hdfs://es05:54310/csv/customer, numFiles=33, columns=[`*`],
> files=[hdfs://es05:54310/csv/customer/customer_20151001.csv,
> hdfs://es05:54310/csv/customer/customer_20151002.csv,
> hdfs://es05:54310/csv/customer/customer_20151003.csv,
> hdfs://es05:54310/csv/customer/customer_20151004.csv,
> hdfs://es05:54310/csv/customer/customer_20151005.csv,
> hdfs://es05:54310/csv/customer/customer_20151006.csv,
> hdfs://es05:54310/csv/customer/customer_20151007.csv,
> hdfs://es05:54310/csv/customer/customer_20151008.csv,
> hdfs://es05:54310/csv/customer/customer_20151009.csv,
> hdfs://es05:54310/csv/customer/customer_20151010.csv,
> hdfs://es05:54310/csv/customer/customer_20151011.csv,
> hdfs://es05:54310/csv/customer/customer_20151012.csv,
> hdfs://es05:54310/csv/customer/customer_20151013.csv,
> hdfs://es05:54310/csv/customer/customer_20151014.csv,
> hdfs://es05:54310/csv/customer/customer_20151015.csv,
> hdfs://es05:54310/csv/customer/customer_20151016.csv,
> hdfs://es05:54310/csv/customer/customer_20151017.csv,
> hdfs://es05:54310/csv/customer/customer_20151018.csv,
> hdfs://es05:54310/csv/customer/customer_20151019.csv,
> hdfs://es05:54310/csv/customer/customer_20151020.csv,
> hdfs://es05:54310/csv/customer/customer_20151021.csv,
> hdfs://es05:54310/csv/customer/customer_20151022.csv,
> hdfs://es05:54310/csv/customer/customer_20151023.csv,
> hdfs://es05:54310/csv/customer/customer_20151024.csv,
> hdfs://es05:54310/csv/customer/customer_20151025.csv,
> hdfs://es05:54310/csv/customer/customer_20151026.csv,
> hdfs://es05:54310/csv/customer/customer_20151027.csv,
> hdfs://es05:54310/csv/customer/customer_20151028.csv,
> hdfs://es05:54310/csv/customer/customer_20151029.csv,
> hdfs://es05:54310/csv/customer/customer_20151030.csv,
> hdfs://es05:54310/csv/customer/customer_20151031.csv,
> hdfs://es05:54310/csv/customer/customer_20151101.csv,
> hdfs://es05:54310/csv/customer/customer_20151102.csv]]]) : rowType =
> (DrillRecordRow[*]): rowcount = 2.407374395E9, cumulative cost =
> {2.407374395E9 rows, 2.407374395E9 cpu, 0.0 io, 0.0 network, 0.0 memory},
> id = 4355


CTAS plan showing single node?

2016-01-21 Thread Matt
Running a CTAS from csv files in a 4 node HDFS cluster into a Parquet 
file, and I note the physical plan in the Drill UI references scans of 
all the csv sources on a single node.


collectl implies read and write IO on all 4 nodes - does this imply that 
the full cluster is used for scanning the source files, or does my 
configuration have the reads pinned to a single node?


Nodes in cluster: es0{5-8}:

Scan(groupscan=[EasyGroupScan 
[selectionRoot=hdfs://es05:54310/csv/customer, numFiles=33, 
columns=[`*`], 
files=[hdfs://es05:54310/csv/customer/customer_20151001.csv, 
hdfs://es05:54310/csv/customer/customer_20151002.csv, 
hdfs://es05:54310/csv/customer/customer_20151003.csv, 
hdfs://es05:54310/csv/customer/customer_20151004.csv, 
hdfs://es05:54310/csv/customer/customer_20151005.csv, 
hdfs://es05:54310/csv/customer/customer_20151006.csv, 
hdfs://es05:54310/csv/customer/customer_20151007.csv, 
hdfs://es05:54310/csv/customer/customer_20151008.csv, 
hdfs://es05:54310/csv/customer/customer_20151009.csv, 
hdfs://es05:54310/csv/customer/customer_20151010.csv, 
hdfs://es05:54310/csv/customer/customer_20151011.csv, 
hdfs://es05:54310/csv/customer/customer_20151012.csv, 
hdfs://es05:54310/csv/customer/customer_20151013.csv, 
hdfs://es05:54310/csv/customer/customer_20151014.csv, 
hdfs://es05:54310/csv/customer/customer_20151015.csv, 
hdfs://es05:54310/csv/customer/customer_20151016.csv, 
hdfs://es05:54310/csv/customer/customer_20151017.csv, 
hdfs://es05:54310/csv/customer/customer_20151018.csv, 
hdfs://es05:54310/csv/customer/customer_20151019.csv, 
hdfs://es05:54310/csv/customer/customer_20151020.csv, 
hdfs://es05:54310/csv/customer/customer_20151021.csv, 
hdfs://es05:54310/csv/customer/customer_20151022.csv, 
hdfs://es05:54310/csv/customer/customer_20151023.csv, 
hdfs://es05:54310/csv/customer/customer_20151024.csv, 
hdfs://es05:54310/csv/customer/customer_20151025.csv, 
hdfs://es05:54310/csv/customer/customer_20151026.csv, 
hdfs://es05:54310/csv/customer/customer_20151027.csv, 
hdfs://es05:54310/csv/customer/customer_20151028.csv, 
hdfs://es05:54310/csv/customer/customer_20151029.csv, 
hdfs://es05:54310/csv/customer/customer_20151030.csv, 
hdfs://es05:54310/csv/customer/customer_20151031.csv, 
hdfs://es05:54310/csv/customer/customer_20151101.csv, 
hdfs://es05:54310/csv/customer/customer_20151102.csv]]]) : rowType = 
(DrillRecordRow[*]): rowcount = 2.407374395E9, cumulative cost = 
{2.407374395E9 rows, 2.407374395E9 cpu, 0.0 io, 0.0 network, 0.0 
memory}, id = 4355