if you look closely in first phase it executes your transform and in second it does your sum operation
On Wed, Jan 23, 2013 at 11:24 AM, Richard <codemon...@163.com> wrote: > thanks. I used explain command and get the plan, but I am still confused. > The below is the description of two map-reduce stages: > > it seems that in stage-1 the aggregation has already been done, why > stage-2 has aggregation again? > > > ========================== > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > a:t1 > TableScan > alias: t1 > Select Operator > expressions: > &nbs p; expr: f > type: string > outputColumnNames: _col0 > Transform Operator > command: mymapper > output info: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Select Operator > expressions: > expr: _col0 > type: string > expr: _col1 > type: string > expr: _col2 > &n bsp; type: string > outputColumnNames: _col0, _col1, _col2 > Group By Operator > aggregations: > expr: sum(_col0) > expr: sum(_col1) >   ; bucketGroup: false > keys: > expr: _col2 > type: string > mode: hash > outputColumnNames: _col0, _col1, _col2 > Reduce Output Operator >   ; key expressions: > expr: _col0 > type: string > sort order: + > Map-reduce partition columns: > expr: rand() > &nb sp; type: double > tag: -1 > value expressions: > expr: _col1 > type: double > expr: _col2 >   ; type: double > Reduce Operator Tree: > Group By Operator > aggregations: > expr: sum(VALUE._col0) > expr: sum(VALUE._col1) > bucketGroup: false > keys: > expr: KEY._col0 > type: string > mode: partials > &n bsp; outputColumnNames: _col0, _col1, _col2 > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: > org.apache.hadoop.mapred.SequenceFileInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > > Stage: Stage-2 > Map Reduce > Alias -> Map Operator Tree: > > hdfs://hdpnn:9000/mydata/hive/hive_2013-01-23_13-46-09_628_5487089660360786955/10002 > & nbsp; Reduce Output Operator > key expressions: > expr: _col0 > type: string > sort order: + > Map-reduce partition columns: > expr: _col0 > type: string > tag: -1 > &nbs p; value expressions: > expr: _col1 > type: double > expr: _col2 > type: double > Reduce Operator Tree: > Group By Operator > aggregations: > expr: sum(VALUE._col0) > &nb sp; expr: sum(VALUE._col1) > bucketGroup: false > keys: > expr: KEY._col0 > type: string > mode: final > outputColumnNames: _col0, _col1, _col2 > Select Operator > expressions: > expr: _col1 > type: double > expr: _col2 > type: double > expr: _col0 > type: string > outputColumnNames: _col0, _col1, _col2 > File Output Operator > compressed: false > GlobalTableId: 0 > table: >   ; input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > ============================ > > > > > > > At 2013-01-23 11:45:13,Richard <codemon...@163.com> wrote: > > I am wondering how to determine the number of map-reduce for a hive query. > > for example, the following query > > select > sum(c1), > sum(c2), > k1 > from > { > select transform(*) using 'mymapper' as c1, c2, k1 > from t 1 > } a group by k1; > > when i run this query, it takes two map-reduce, but I expect it to take > only 1. > in the map stage, using 'mymapper' as the mapper, then shuffle the mapper > output by k1 and perform sum reduce in the reducer. > > so why hive takes 2 map-reduce? > > > > > -- Nitin Pawar