Hello everyone, How does Tez calculate the number of mappers and reducers? We have a custom StorageHandler, that when used with tez miscalculates the number of mappers when doing a join. I’ve included an EXPLAIN EXTENDED of a sample query below. One thing I have noticed is that under properties both numFiles and totalSize are listed as zero(as high lighted in red). While our StorageHandler does utilize a SERDE that correctly returns SerDeStats, it seems like the optimizer is ignoring these values. Additionally, I can’t seem to find any obvious place where I would set numFiles and totalSize from the StorageHandler interface. Would anyone know how to correctly set these values?
Thanks for your time Andrew Long hive> EXPLAIN EXTENDED SELECT COUNT(*) FROM input_table a JOIN input_table b ON a.account_number = b.account_number; OK ABSTRACT SYNTAX TREE: TOK_QUERY TOK_FROM TOK_JOIN TOK_TABREF TOK_TABNAME input_table a TOK_TABREF TOK_TABNAME input_table b = . TOK_TABLE_OR_COL a account_number . TOK_TABLE_OR_COL b account_number TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_FUNCTIONSTAR COUNT STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE) DagName: hadoop_20160624232828_459c67f5-c870-44c8-924e-39849cd7ae3c:1 Vertices: Map 1 Map Operator Tree: TableScan alias: b filterExpr: account_number is not null (type: boolean) Statistics: Num rows: 7463603 Data size: 4780011697 Basic stats: COMPLETE Column stats: NONE GatherStats: false Filter Operator isSamplingPred: false predicate: account_number is not null (type: boolean) Statistics: Num rows: 3731802 Data size: 2390006168 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: account_number (type: string) sort order: + Map-reduce partition columns: account_number (type: string) Statistics: Num rows: 3731802 Data size: 2390006168 Basic stats: COMPLETE Column stats: NONE tag: 1 auto parallelism: true Path -> Alias: hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table [b] Path -> Partition: hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table Partition base file name: input_table input format: amazon.conexio.hive.EDXManifestHiveInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat properties: COLUMN_STATS_ACCURATE true EXTERNAL TRUE amazon.conexio.input.format TSV amazon.conexio.input.key.path hdfs:/keys/cluster-private-key.ion amazon.conexio.input.manifest.path s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-manifest.ion amazon.conexio.schema.binding.automatic true amazon.conexio.schema.binding.override s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-schema.xml bucket_count -1 columns ignored columns.comments columns.types int file.inputformat org.apache.hadoop.mapred.SequenceFileInputFormat file.outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat location hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table name default.input_table numFiles 0 numRows 7463603 rawDataSize 4780011697 serialization.ddl struct input_table { i32 ignored} serialization.format 1 serialization.lib amazon.conexio.hive.serde.edx.GenericEDXSerDe storage_handler amazon.conexio.hive.EDXManifestStorageHandler totalSize 0 transient_lastDdlTime 1466809909 serde: amazon.conexio.hive.serde.edx.GenericEDXSerDe input format: amazon.conexio.hive.EDXManifestHiveInputFormat jobProperties: amazon.conexio.input.format TSV amazon.conexio.input.key.path hdfs:/keys/cluster-private-key.ion amazon.conexio.input.manifest.path s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-manifest.ion amazon.conexio.schema.binding.automatic true amazon.conexio.schema.binding.override s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-schema.xml output format: amazon.conexio.hive.EDXManifestHiveOutputFormat properties: COLUMN_STATS_ACCURATE true EXTERNAL TRUE amazon.conexio.input.format TSV amazon.conexio.input.key.path hdfs:/keys/cluster-private-key.ion amazon.conexio.input.manifest.path s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-manifest.ion amazon.conexio.schema.binding.automatic true amazon.conexio.schema.binding.override s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-schema.xml bucket_count -1 columns ignored columns.comments columns.types int file.inputformat org.apache.hadoop.mapred.SequenceFileInputFormat file.outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat location hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table name default.input_table numFiles 0 numRows 7463603 rawDataSize 4780011697 serialization.ddl struct input_table { i32 ignored} serialization.format 1 serialization.lib amazon.conexio.hive.serde.edx.GenericEDXSerDe storage_handler amazon.conexio.hive.EDXManifestStorageHandler totalSize 0 transient_lastDdlTime 1466809909 serde: amazon.conexio.hive.serde.edx.GenericEDXSerDe name: default.input_table name: default.input_table Truncated Path -> Alias: /input_table [b] Map 4 Map Operator Tree: TableScan alias: a filterExpr: account_number is not null (type: boolean) Statistics: Num rows: 7463603 Data size: 4780011697 Basic stats: COMPLETE Column stats: NONE GatherStats: false Filter Operator isSamplingPred: false predicate: account_number is not null (type: boolean) Statistics: Num rows: 3731802 Data size: 2390006168 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: account_number (type: string) sort order: + Map-reduce partition columns: account_number (type: string) Statistics: Num rows: 3731802 Data size: 2390006168 Basic stats: COMPLETE Column stats: NONE tag: 0 auto parallelism: true Path -> Alias: hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table [a] Path -> Partition: hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table Partition base file name: input_table input format: amazon.conexio.hive.EDXManifestHiveInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat properties: COLUMN_STATS_ACCURATE true EXTERNAL TRUE amazon.conexio.input.format TSV amazon.conexio.input.key.path hdfs:/keys/cluster-private-key.ion amazon.conexio.input.manifest.path s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-manifest.ion amazon.conexio.schema.binding.automatic true amazon.conexio.schema.binding.override s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-schema.xml bucket_count -1 columns ignored columns.comments columns.types int file.inputformat org.apache.hadoop.mapred.SequenceFileInputFormat file.outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat location hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table name default.input_table numFiles 0 numRows 7463603 rawDataSize 4780011697 serialization.ddl struct input_table { i32 ignored} serialization.format 1 serialization.lib amazon.conexio.hive.serde.edx.GenericEDXSerDe storage_handler amazon.conexio.hive.EDXManifestStorageHandler totalSize 0 transient_lastDdlTime 1466809909 serde: amazon.conexio.hive.serde.edx.GenericEDXSerDe input format: amazon.conexio.hive.EDXManifestHiveInputFormat jobProperties: amazon.conexio.input.format TSV amazon.conexio.input.key.path hdfs:/keys/cluster-private-key.ion amazon.conexio.input.manifest.path s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-manifest.ion amazon.conexio.schema.binding.automatic true amazon.conexio.schema.binding.override s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-schema.xml output format: amazon.conexio.hive.EDXManifestHiveOutputFormat properties: COLUMN_STATS_ACCURATE true EXTERNAL TRUE amazon.conexio.input.format TSV amazon.conexio.input.key.path hdfs:/keys/cluster-private-key.ion amazon.conexio.input.manifest.path s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-manifest.ion amazon.conexio.schema.binding.automatic true amazon.conexio.schema.binding.override s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-schema.xml bucket_count -1 columns ignored columns.comments columns.types int file.inputformat org.apache.hadoop.mapred.SequenceFileInputFormat file.outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat location hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table name default.input_table numFiles 0 numRows 7463603 rawDataSize 4780011697 serialization.ddl struct input_table { i32 ignored} serialization.format 1 serialization.lib amazon.conexio.hive.serde.edx.GenericEDXSerDe storage_handler amazon.conexio.hive.EDXManifestStorageHandler totalSize 0 transient_lastDdlTime 1466809909 serde: amazon.conexio.hive.serde.edx.GenericEDXSerDe name: default.input_table name: default.input_table Truncated Path -> Alias: /input_table [a] Reducer 2 Needs Tagging: false Reduce Operator Tree: Merge Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 1 Position of Big Table: 0 Statistics: Num rows: 4104982 Data size: 2629006841 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 4104982 Data size: 2629006841 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count() mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE tag: -1 value expressions: _col0 (type: bigint) auto parallelism: false Reducer 3 Needs Tagging: false Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: bigint) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false GlobalTableId: 0 directory: hdfs://ip-172-31-9-161.ec2.internal:8020/tmp/hive/hadoop/146738ed-6f94-40fb-9741-f097f254c1ba/hive_2016-06-24_23-28-17_314_5714360880755224101-1/-ext-10001 NumFilesPerFileSink: 1 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Stats Publishing Key Prefix: hdfs://ip-172-31-9-161.ec2.internal:8020/tmp/hive/hadoop/146738ed-6f94-40fb-9741-f097f254c1ba/hive_2016-06-24_23-28-17_314_5714360880755224101-1/-ext-10001/ table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: columns _col0 columns.types bigint escape.delim \ hive.serialization.extend.nesting.levels true serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe TotalFiles: 1 GatherStats: false MultiFileSpray: false Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 0.535 seconds, Fetched: 290 row(s) hive> SELECT COUNT(*) FROM input_table a JOIN input_table b ON a.account_number = b.account_number; Query ID = hadoop_20160624232929_381b2a05-de3f-4838-814e-72bb4eb6263a Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1466732779256_0016) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 RUNNING 1 0 1 0 0 0 Map 4 RUNNING 1 0 1 0 0 0 Reducer 2 INITED 19 0 0 19 0 0 Reducer 3 INITED 1 0 0 1 0 0 -------------------------------------------------------------------------------- VERTICES: 00/04 [>>--------------------------] 0% ELAPSED TIME: 189.30 s --------------------------------------------------------------------------------