Re: Reduce the number of map/reduce jobs during join
Do the joins share the same key? 2012/3/13 Bruce Bian > Yes,it's in my hive-default.xml and Hive figured to use one reducer only, > so I thought increase it to 5 might help,which doesn't. > Anyway, to scan the largest table 6 times isn't efficient hence my > question. > > > On Wed, Mar 14, 2012 at 12:37 AM, Jagat wrote: > > > > Hello Weidong Bian > > > > Did you see the following configuration properties in conf directory > > > > > > > > mapred.reduce.tasks > > -1 > > The default number of reduce tasks per job. Typically > set > > to a prime close to the number of available hosts. Ignored when > > mapred.job.tracker is "local". Hadoop set this to 1 by default, > whereas hive uses -1 as its default value. > > By setting this property to -1, Hive will automatically figure out > what should be the number of reducers. > > > > > > > > > > > > hive.exec.reducers.max > > 999 > > max number of reducers will be used. If the one > > specified in the configuration parameter mapred.reduce.tasks is > > negative, hive will use this one as the max number of reducers when > > automatically determine number of reducers. > > > > > > Thanks and Regards > > > > Jagat > > > > > > On Tue, Mar 13, 2012 at 9:54 PM, Bruce Bian > wrote: > >> > >> Hi there, > >> when I'm using Hive to doing a query as follows, 6 Map/Reduce jobs are > launched, one for each join, and it deals with ~460M data in ~950 seconds, > which I think is way t slow for a cluster with 5 slaves and 24GB > memory/12 disks each. > >> set mapred.reduce.tasks=5; > >> SELECT a.*,e.code_name as is_internet_flg, f.code_name as > wb_access_tp_desc, g.code_name as free_tp_desc, > >> b.acnt_no,b.addr_id,b.postcode,b.acnt_rmnd_tp,b.print_tp,b.media_type, > >> c.cust_code,c.root_cust_code, > >> > d.mdf_name,d.sub_bureau_code,d.bureau_cd,d.adm_sub_bureau_name,d.bureau_name > >> FROM prc_idap_pi_root a > >> LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id > >> LEFT OUTER JOIN idap_pi_root_cust c ON a.cust_id=c.cust_id > >> LEFT OUTER JOIN ocrm_vt_area d ON a.dev_area_id=d.area_id > >> LEFT OUTER JOIN osor_code e ON a.data_internet_flg=e.code_val and > e.code_tp='IS_INTERNET_FLG' > >> LEFT OUTER JOIN osor_code f ON a.wb_access_tp=f.code_val and > f.code_tp='WEB_ACCESS_TP' > >> LEFT OUTER JOIN osor_code g ON a.free_tp=g.code_val and > g.code_tp='FREE_TP'; > >> For each jobs, most of the time is consumed by the reduce jobs. As the > idap_pi_root is very large, to scan over it for 6 times is quite > inefficient. Is it possible to reduce the map/reduce jobs to only one? > >> Thanks, > >> Weidong Bian > >
Reduce the number of map/reduce jobs during join
Yes,it's in my hive-default.xml and Hive figured to use one reducer only, so I thought increase it to 5 might help,which doesn't. Anyway, to scan the largest table 6 times isn't efficient hence my question. On Wed, Mar 14, 2012 at 12:37 AM, Jagat wrote: > > Hello Weidong Bian > > Did you see the following configuration properties in conf directory > > > > mapred.reduce.tasks > -1 > The default number of reduce tasks per job. Typically set > to a prime close to the number of available hosts. Ignored when > mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive uses -1 as its default value. > By setting this property to -1, Hive will automatically figure out what should be the number of reducers. > > > > > > hive.exec.reducers.max > 999 > max number of reducers will be used. If the one > specified in the configuration parameter mapred.reduce.tasks is > negative, hive will use this one as the max number of reducers when > automatically determine number of reducers. > > > Thanks and Regards > > Jagat > > > On Tue, Mar 13, 2012 at 9:54 PM, Bruce Bian wrote: >> >> Hi there, >> when I'm using Hive to doing a query as follows, 6 Map/Reduce jobs are launched, one for each join, and it deals with ~460M data in ~950 seconds, which I think is way t slow for a cluster with 5 slaves and 24GB memory/12 disks each. >> set mapred.reduce.tasks=5; >> SELECT a.*,e.code_name as is_internet_flg, f.code_name as wb_access_tp_desc, g.code_name as free_tp_desc, >> b.acnt_no,b.addr_id,b.postcode,b.acnt_rmnd_tp,b.print_tp,b.media_type, >> c.cust_code,c.root_cust_code, >> d.mdf_name,d.sub_bureau_code,d.bureau_cd,d.adm_sub_bureau_name,d.bureau_name >> FROM prc_idap_pi_root a >> LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id >> LEFT OUTER JOIN idap_pi_root_cust c ON a.cust_id=c.cust_id >> LEFT OUTER JOIN ocrm_vt_area d ON a.dev_area_id=d.area_id >> LEFT OUTER JOIN osor_code e ON a.data_internet_flg=e.code_val and e.code_tp='IS_INTERNET_FLG' >> LEFT OUTER JOIN osor_code f ON a.wb_access_tp=f.code_val and f.code_tp='WEB_ACCESS_TP' >> LEFT OUTER JOIN osor_code g ON a.free_tp=g.code_val and g.code_tp='FREE_TP'; >> For each jobs, most of the time is consumed by the reduce jobs. As the idap_pi_root is very large, to scan over it for 6 times is quite inefficient. Is it possible to reduce the map/reduce jobs to only one? >> Thanks, >> Weidong Bian
Re: Reduce the number of map/reduce jobs during join
Hello Weidong Bian Did you see the following configuration properties in conf directory mapred.reduce.tasks -1 The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers. hive.exec.reducers.max 999 max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is negative, hive will use this one as the max number of reducers when automatically determine number of reducers. Thanks and Regards Jagat On Tue, Mar 13, 2012 at 9:54 PM, Bruce Bian wrote: > Hi there, > when I'm using Hive to doing a query as follows, 6 Map/Reduce jobs are > launched, one for each join, and it deals with ~460M data in ~950 seconds, > which I think is way t slow for a cluster with 5 slaves and 24GB > memory/12 disks each. > > set mapred.reduce.tasks=5; > SELECT a.*,e.code_name as is_internet_flg, f.code_name as > wb_access_tp_desc, g.code_name as free_tp_desc, > b.acnt_no,b.addr_id,b.postcode,b.acnt_rmnd_tp,b.print_tp,b.media_type, > c.cust_code,c.root_cust_code, > > d.mdf_name,d.sub_bureau_code,d.bureau_cd,d.adm_sub_bureau_name,d.bureau_name > FROM prc_idap_pi_root a > LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id > LEFT OUTER JOIN idap_pi_root_cust c ON a.cust_id=c.cust_id > LEFT OUTER JOIN ocrm_vt_area d ON a.dev_area_id=d.area_id > LEFT OUTER JOIN osor_code e ON a.data_internet_flg=e.code_val and > e.code_tp='IS_INTERNET_FLG' > LEFT OUTER JOIN osor_code f ON a.wb_access_tp=f.code_val and > f.code_tp='WEB_ACCESS_TP' > LEFT OUTER JOIN osor_code g ON a.free_tp=g.code_val and > g.code_tp='FREE_TP'; > > For each jobs, most of the time is consumed by the reduce jobs. As the > idap_pi_root is very large, to scan over it for 6 times is quite > inefficient. Is it possible to reduce the map/reduce jobs to only one? > > Thanks, > Weidong Bian >