Running a CTAS from csv files in a 4 node HDFS cluster into a Parquet file, and I note the physical plan in the Drill UI references scans of all the csv sources on a single node.

collectl implies read and write IO on all 4 nodes - does this imply that the full cluster is used for scanning the source files, or does my configuration have the reads pinned to a single node?

Nodes in cluster: es0{5-8}:

Scan(groupscan=[EasyGroupScan [selectionRoot=hdfs://es05:54310/csv/customer, numFiles=33, columns=[`*`], files=[hdfs://es05:54310/csv/customer/customer_20151001.csv, hdfs://es05:54310/csv/customer/customer_20151002.csv, hdfs://es05:54310/csv/customer/customer_20151003.csv, hdfs://es05:54310/csv/customer/customer_20151004.csv, hdfs://es05:54310/csv/customer/customer_20151005.csv, hdfs://es05:54310/csv/customer/customer_20151006.csv, hdfs://es05:54310/csv/customer/customer_20151007.csv, hdfs://es05:54310/csv/customer/customer_20151008.csv, hdfs://es05:54310/csv/customer/customer_20151009.csv, hdfs://es05:54310/csv/customer/customer_20151010.csv, hdfs://es05:54310/csv/customer/customer_20151011.csv, hdfs://es05:54310/csv/customer/customer_20151012.csv, hdfs://es05:54310/csv/customer/customer_20151013.csv, hdfs://es05:54310/csv/customer/customer_20151014.csv, hdfs://es05:54310/csv/customer/customer_20151015.csv, hdfs://es05:54310/csv/customer/customer_20151016.csv, hdfs://es05:54310/csv/customer/customer_20151017.csv, hdfs://es05:54310/csv/customer/customer_20151018.csv, hdfs://es05:54310/csv/customer/customer_20151019.csv, hdfs://es05:54310/csv/customer/customer_20151020.csv, hdfs://es05:54310/csv/customer/customer_20151021.csv, hdfs://es05:54310/csv/customer/customer_20151022.csv, hdfs://es05:54310/csv/customer/customer_20151023.csv, hdfs://es05:54310/csv/customer/customer_20151024.csv, hdfs://es05:54310/csv/customer/customer_20151025.csv, hdfs://es05:54310/csv/customer/customer_20151026.csv, hdfs://es05:54310/csv/customer/customer_20151027.csv, hdfs://es05:54310/csv/customer/customer_20151028.csv, hdfs://es05:54310/csv/customer/customer_20151029.csv, hdfs://es05:54310/csv/customer/customer_20151030.csv, hdfs://es05:54310/csv/customer/customer_20151031.csv, hdfs://es05:54310/csv/customer/customer_20151101.csv, hdfs://es05:54310/csv/customer/customer_20151102.csv]]]) : rowType = (DrillRecordRow[*]): rowcount = 2.407374395E9, cumulative cost = {2.407374395E9 rows, 2.407374395E9 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 4355

Reply via email to