thanks. I used explain command and get the plan, but I am still confused. The below is the description of two map-reduce stages:
it seems that in stage-1 the aggregation has already been done, why stage-2 has aggregation again? ========================== STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a:t1 TableScan alias: t1 Select Operator expressions: expr: f type: string outputColumnNames: _col0 Transform Operator command: mymapper output info: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Select Operator expressions: expr: _col0 type: string expr: _col1 type: string expr: _col2 type: string outputColumnNames: _col0, _col1, _col2 Group By Operator aggregations: expr: sum(_col0) expr: sum(_col1) bucketGroup: false keys: expr: _col2 type: string mode: hash outputColumnNames: _col0, _col1, _col2 Reduce Output Operator key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: rand() type: double tag: -1 value expressions: expr: _col1 type: double expr: _col2 type: double Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE._col0) expr: sum(VALUE._col1) bucketGroup: false keys: expr: KEY._col0 type: string mode: partials outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: hdfs://hdpnn:9000/mydata/hive/hive_2013-01-23_13-46-09_628_5487089660360786955/10002 Reduce Output Operator key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: double expr: _col2 type: double Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE._col0) expr: sum(VALUE._col1) bucketGroup: false keys: expr: KEY._col0 type: string mode: final outputColumnNames: _col0, _col1, _col2 Select Operator expressions: expr: _col1 type: double expr: _col2 type: double expr: _col0 type: string outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat ============================ At 2013-01-23 11:45:13,Richard <codemon...@163.com> wrote: I am wondering how to determine the number of map-reduce for a hive query. for example, the following query select sum(c1), sum(c2), k1 from { select transform(*) using 'mymapper' as c1, c2, k1 from t1 } a group by k1; when i run this query, it takes two map-reduce, but I expect it to take only 1. in the map stage, using 'mymapper' as the mapper, then shuffle the mapper output by k1 and perform sum reduce in the reducer. so why hive takes 2 map-reduce?