Hello everyone,

How does Tez calculate the number of mappers and reducers?  We have a custom 
StorageHandler, that when used with tez miscalculates the number of mappers 
when doing a join.  I’ve included an EXPLAIN EXTENDED of a sample query below.  
One thing I have noticed is that under properties both numFiles and totalSize 
are listed as zero(as high lighted in red).  While our StorageHandler does 
utilize a SERDE that correctly returns SerDeStats, it seems like the optimizer 
is ignoring these values.  Additionally, I can’t seem to find any obvious place 
where I would set numFiles and totalSize from the StorageHandler interface. 
Would anyone know how to correctly set these values?

Thanks for your time
Andrew Long

hive> EXPLAIN EXTENDED SELECT COUNT(*) FROM input_table a JOIN input_table b ON 
a.account_number = b.account_number;
OK
ABSTRACT SYNTAX TREE:

TOK_QUERY
   TOK_FROM
      TOK_JOIN
         TOK_TABREF
            TOK_TABNAME
               input_table
            a
         TOK_TABREF
            TOK_TABNAME
               input_table
            b
         =
            .
               TOK_TABLE_OR_COL
                  a
               account_number
            .
               TOK_TABLE_OR_COL
                  b
               account_number
   TOK_INSERT
      TOK_DESTINATION
         TOK_DIR
            TOK_TMP_FILE
      TOK_SELECT
         TOK_SELEXPR
            TOK_FUNCTIONSTAR
               COUNT


STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
      DagName: hadoop_20160624232828_459c67f5-c870-44c8-924e-39849cd7ae3c:1
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: b
                  filterExpr: account_number is not null (type: boolean)
                  Statistics: Num rows: 7463603 Data size: 4780011697 Basic 
stats: COMPLETE Column stats: NONE
                  GatherStats: false
                  Filter Operator
                    isSamplingPred: false
                    predicate: account_number is not null (type: boolean)
                    Statistics: Num rows: 3731802 Data size: 2390006168 Basic 
stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      key expressions: account_number (type: string)
                      sort order: +
                      Map-reduce partition columns: account_number (type: 
string)
                      Statistics: Num rows: 3731802 Data size: 2390006168 Basic 
stats: COMPLETE Column stats: NONE
                      tag: 1
                      auto parallelism: true
            Path -> Alias:
              
hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table [b]
            Path -> Partition:
              
hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table
                Partition
                  base file name: input_table
                  input format: amazon.conexio.hive.EDXManifestHiveInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  properties:
                    COLUMN_STATS_ACCURATE true
                    EXTERNAL TRUE
                    amazon.conexio.input.format TSV
                    amazon.conexio.input.key.path 
hdfs:/keys/cluster-private-key.ion
                    amazon.conexio.input.manifest.path 
s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-manifest.ion
                    amazon.conexio.schema.binding.automatic true
                    amazon.conexio.schema.binding.override 
s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-schema.xml
                    bucket_count -1
                    columns ignored
                    columns.comments
                    columns.types int
                    file.inputformat 
org.apache.hadoop.mapred.SequenceFileInputFormat
                    file.outputformat 
org.apache.hadoop.mapred.SequenceFileOutputFormat
                    location 
hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table
                    name default.input_table
                    numFiles 0
                    numRows 7463603
                    rawDataSize 4780011697
                    serialization.ddl struct input_table { i32 ignored}
                    serialization.format 1
                    serialization.lib 
amazon.conexio.hive.serde.edx.GenericEDXSerDe
                    storage_handler 
amazon.conexio.hive.EDXManifestStorageHandler
                    totalSize 0
                    transient_lastDdlTime 1466809909
                  serde: amazon.conexio.hive.serde.edx.GenericEDXSerDe

                    input format: amazon.conexio.hive.EDXManifestHiveInputFormat
                    jobProperties:
                      amazon.conexio.input.format TSV
                      amazon.conexio.input.key.path 
hdfs:/keys/cluster-private-key.ion
                      amazon.conexio.input.manifest.path 
s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-manifest.ion
                      amazon.conexio.schema.binding.automatic true
                      amazon.conexio.schema.binding.override 
s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-schema.xml
                    output format: 
amazon.conexio.hive.EDXManifestHiveOutputFormat
                    properties:
                      COLUMN_STATS_ACCURATE true
                      EXTERNAL TRUE
                      amazon.conexio.input.format TSV
                      amazon.conexio.input.key.path 
hdfs:/keys/cluster-private-key.ion
                      amazon.conexio.input.manifest.path 
s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-manifest.ion
                      amazon.conexio.schema.binding.automatic true
                      amazon.conexio.schema.binding.override 
s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-schema.xml
                      bucket_count -1
                      columns ignored
                      columns.comments
                      columns.types int
                      file.inputformat 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      file.outputformat 
org.apache.hadoop.mapred.SequenceFileOutputFormat
                      location 
hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table
                      name default.input_table
                      numFiles 0
                      numRows 7463603
                      rawDataSize 4780011697
                      serialization.ddl struct input_table { i32 ignored}
                      serialization.format 1
                      serialization.lib 
amazon.conexio.hive.serde.edx.GenericEDXSerDe
                      storage_handler 
amazon.conexio.hive.EDXManifestStorageHandler
                      totalSize 0
                      transient_lastDdlTime 1466809909
                    serde: amazon.conexio.hive.serde.edx.GenericEDXSerDe
                    name: default.input_table
                  name: default.input_table
            Truncated Path -> Alias:
             /input_table [b]
        Map 4
            Map Operator Tree:
                TableScan
                  alias: a
                  filterExpr: account_number is not null (type: boolean)
                  Statistics: Num rows: 7463603 Data size: 4780011697 Basic 
stats: COMPLETE Column stats: NONE
                  GatherStats: false
                  Filter Operator
                    isSamplingPred: false
                    predicate: account_number is not null (type: boolean)
                    Statistics: Num rows: 3731802 Data size: 2390006168 Basic 
stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      key expressions: account_number (type: string)
                      sort order: +
                      Map-reduce partition columns: account_number (type: 
string)
                      Statistics: Num rows: 3731802 Data size: 2390006168 Basic 
stats: COMPLETE Column stats: NONE
                      tag: 0
                      auto parallelism: true
            Path -> Alias:
              
hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table [a]
            Path -> Partition:
              
hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table
                Partition
                  base file name: input_table
                  input format: amazon.conexio.hive.EDXManifestHiveInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  properties:
                    COLUMN_STATS_ACCURATE true
                    EXTERNAL TRUE
                    amazon.conexio.input.format TSV
                    amazon.conexio.input.key.path 
hdfs:/keys/cluster-private-key.ion
                    amazon.conexio.input.manifest.path 
s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-manifest.ion
                    amazon.conexio.schema.binding.automatic true
                    amazon.conexio.schema.binding.override 
s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-schema.xml
                    bucket_count -1
                    columns ignored
                    columns.comments
                    columns.types int
                    file.inputformat 
org.apache.hadoop.mapred.SequenceFileInputFormat
                    file.outputformat 
org.apache.hadoop.mapred.SequenceFileOutputFormat
                    location 
hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table
                    name default.input_table
                    numFiles 0
                    numRows 7463603
                    rawDataSize 4780011697
                    serialization.ddl struct input_table { i32 ignored}
                    serialization.format 1
                    serialization.lib 
amazon.conexio.hive.serde.edx.GenericEDXSerDe
                    storage_handler 
amazon.conexio.hive.EDXManifestStorageHandler
                    totalSize 0
                    transient_lastDdlTime 1466809909
                  serde: amazon.conexio.hive.serde.edx.GenericEDXSerDe

                    input format: amazon.conexio.hive.EDXManifestHiveInputFormat
                    jobProperties:
                      amazon.conexio.input.format TSV
                      amazon.conexio.input.key.path 
hdfs:/keys/cluster-private-key.ion
                      amazon.conexio.input.manifest.path 
s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-manifest.ion
                      amazon.conexio.schema.binding.automatic true
                      amazon.conexio.schema.binding.override 
s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-schema.xml
                   output format: 
amazon.conexio.hive.EDXManifestHiveOutputFormat
                    properties:
                      COLUMN_STATS_ACCURATE true
                      EXTERNAL TRUE
                      amazon.conexio.input.format TSV
                      amazon.conexio.input.key.path 
hdfs:/keys/cluster-private-key.ion
                      amazon.conexio.input.manifest.path 
s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-manifest.ion
                      amazon.conexio.schema.binding.automatic true
                      amazon.conexio.schema.binding.override 
s3n://horizonhive-executer-integ-test/emr/cluster/j-3S0P05X6DPTI3/message-8053ed22-154d-489c-bfc6-e7b402f58645/input-etlm-gfs-matcher-f-f-us-r-1-daily-schema.xml
                      bucket_count -1
                      columns ignored
                      columns.comments
                      columns.types int
                      file.inputformat 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      file.outputformat 
org.apache.hadoop.mapred.SequenceFileOutputFormat
                      location 
hdfs://ip-172-31-9-161.ec2.internal:8020/user/hive/warehouse/input_table
                      name default.input_table
                      numFiles 0
                      numRows 7463603
                      rawDataSize 4780011697
                      serialization.ddl struct input_table { i32 ignored}
                      serialization.format 1
                      serialization.lib 
amazon.conexio.hive.serde.edx.GenericEDXSerDe
                      storage_handler 
amazon.conexio.hive.EDXManifestStorageHandler
                      totalSize 0
                      transient_lastDdlTime 1466809909
                    serde: amazon.conexio.hive.serde.edx.GenericEDXSerDe
                    name: default.input_table
                  name: default.input_table
            Truncated Path -> Alias:
             /input_table [a]
        Reducer 2
            Needs Tagging: false
            Reduce Operator Tree:
              Merge Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0
                  1
                Position of Big Table: 0
                Statistics: Num rows: 4104982 Data size: 2629006841 Basic 
stats: COMPLETE Column stats: NONE
                Select Operator
                  Statistics: Num rows: 4104982 Data size: 2629006841 Basic 
stats: COMPLETE Column stats: NONE
                  Group By Operator
                    aggregations: count()
                    mode: hash
                    outputColumnNames: _col0
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                    Reduce Output Operator
                      sort order:
                      Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: NONE
                      tag: -1
                      value expressions: _col0 (type: bigint)
                      auto parallelism: false
        Reducer 3
            Needs Tagging: false
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                Select Operator
                  expressions: _col0 (type: bigint)
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                  File Output Operator
                    compressed: false
                    GlobalTableId: 0
                    directory: 
hdfs://ip-172-31-9-161.ec2.internal:8020/tmp/hive/hadoop/146738ed-6f94-40fb-9741-f097f254c1ba/hive_2016-06-24_23-28-17_314_5714360880755224101-1/-ext-10001
                    NumFilesPerFileSink: 1
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                    Stats Publishing Key Prefix: 
hdfs://ip-172-31-9-161.ec2.internal:8020/tmp/hive/hadoop/146738ed-6f94-40fb-9741-f097f254c1ba/hive_2016-06-24_23-28-17_314_5714360880755224101-1/-ext-10001/
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        properties:
                          columns _col0
                          columns.types bigint
                          escape.delim \
                          hive.serialization.extend.nesting.levels true
                          serialization.format 1
                          serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                        serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    TotalFiles: 1
                    GatherStats: false
                    MultiFileSpray: false

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.535 seconds, Fetched: 290 row(s)
hive> SELECT COUNT(*) FROM input_table a JOIN input_table b ON a.account_number 
= b.account_number;
Query ID = hadoop_20160624232929_381b2a05-de3f-4838-814e-72bb4eb6263a
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id 
application_1466732779256_0016)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1                RUNNING      1          0        1        0       0       0
Map 4                RUNNING      1          0        1        0       0       0
Reducer 2             INITED     19          0        0       19       0       0
Reducer 3             INITED      1          0        0        1       0       0
--------------------------------------------------------------------------------
VERTICES: 00/04  [>>--------------------------] 0%    ELAPSED TIME: 189.30 s
--------------------------------------------------------------------------------


Reply via email to