Yes,it's in my hive-default.xml and Hive figured to use one reducer only, so I thought increase it to 5 might help,which doesn't. Anyway, to scan the largest table 6 times isn't efficient hence my question.
On Wed, Mar 14, 2012 at 12:37 AM, Jagat <jagatsi...@gmail.com> wrote: > > Hello Weidong Bian > > Did you see the following configuration properties in conf directory > > > <property> > <name>mapred.reduce.tasks</name> > <value>-1</value> > <description>The default number of reduce tasks per job. Typically set > to a prime close to the number of available hosts. Ignored when > mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive uses -1 as its default value. > By setting this property to -1, Hive will automatically figure out what should be the number of reducers. > </description> > </property> > > > <property> > <name>hive.exec.reducers.max</name> > <value>999</value> > <description>max number of reducers will be used. If the one > specified in the configuration parameter mapred.reduce.tasks is > negative, hive will use this one as the max number of reducers when > automatically determine number of reducers.</description> > </property> > > Thanks and Regards > > Jagat > > > On Tue, Mar 13, 2012 at 9:54 PM, Bruce Bian <weidong....@gmail.com> wrote: >> >> Hi there, >> when I'm using Hive to doing a query as follows, 6 Map/Reduce jobs are launched, one for each join, and it deals with ~460M data in ~950 seconds, which I think is way toooo slow for a cluster with 5 slaves and 24GB memory/12 disks each. >> set mapred.reduce.tasks=5; >> SELECT a.*,e.code_name as is_internet_flg, f.code_name as wb_access_tp_desc, g.code_name as free_tp_desc, >> b.acnt_no,b.addr_id,b.postcode,b.acnt_rmnd_tp,b.print_tp,b.media_type, >> c.cust_code,c.root_cust_code, >> d.mdf_name,d.sub_bureau_code,d.bureau_cd,d.adm_sub_bureau_name,d.bureau_name >> FROM prc_idap_pi_root a >> LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id >> LEFT OUTER JOIN idap_pi_root_cust c ON a.cust_id=c.cust_id >> LEFT OUTER JOIN ocrm_vt_area d ON a.dev_area_id=d.area_id >> LEFT OUTER JOIN osor_code e ON a.data_internet_flg=e.code_val and e.code_tp='IS_INTERNET_FLG' >> LEFT OUTER JOIN osor_code f ON a.wb_access_tp=f.code_val and f.code_tp='WEB_ACCESS_TP' >> LEFT OUTER JOIN osor_code g ON a.free_tp=g.code_val and g.code_tp='FREE_TP'; >> For each jobs, most of the time is consumed by the reduce jobs. As the idap_pi_root is very large, to scan over it for 6 times is quite inefficient. Is it possible to reduce the map/reduce jobs to only one? >> Thanks, >> Weidong Bian