Reduce the number of map/reduce jobs during join

Bruce Bian Tue, 13 Mar 2012 09:58:28 -0700

Yes,it's in my hive-default.xml and Hive figured to use one reducer only,
so I thought increase it to 5 might help,which doesn't.
Anyway, to scan the largest table 6 times isn't efficient hence my question.


On Wed, Mar 14, 2012 at 12:37 AM, Jagat <jagatsi...@gmail.com> wrote:
>
> Hello Weidong Bian
>
> Did you see the following configuration properties in conf directory
>
>
> <property>
>   <name>mapred.reduce.tasks</name>
>   <value>-1</value>
>     <description>The default number of reduce tasks per job.  Typically
set
>   to a prime close to the number of available hosts.  Ignored when
>   mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas
hive uses -1 as its default value.
>   By setting this property to -1, Hive will automatically figure out what
should be the number of reducers.
>   </description>
> </property>
>
>
> <property>
>   <name>hive.exec.reducers.max</name>
>   <value>999</value>
>   <description>max number of reducers will be used. If the one
>     specified in the configuration parameter mapred.reduce.tasks is
>     negative, hive will use this one as the max number of reducers when
>     automatically determine number of reducers.</description>
> </property>
>
> Thanks and Regards
>
> Jagat
>
>
> On Tue, Mar 13, 2012 at 9:54 PM, Bruce Bian <weidong....@gmail.com> wrote:
>>
>> Hi there,
>> when I'm using Hive to doing a query as follows, 6 Map/Reduce jobs are
launched, one for each join, and it deals with ~460M data in ~950 seconds,
which I think is way toooo slow for a cluster with 5 slaves and 24GB
memory/12 disks each.
>> set mapred.reduce.tasks=5;
>> SELECT a.*,e.code_name as is_internet_flg, f.code_name as
wb_access_tp_desc, g.code_name as free_tp_desc,
>> b.acnt_no,b.addr_id,b.postcode,b.acnt_rmnd_tp,b.print_tp,b.media_type,
>> c.cust_code,c.root_cust_code,
>>
d.mdf_name,d.sub_bureau_code,d.bureau_cd,d.adm_sub_bureau_name,d.bureau_name
>> FROM prc_idap_pi_root a
>>  LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id
>>  LEFT OUTER JOIN idap_pi_root_cust c ON a.cust_id=c.cust_id
>>  LEFT OUTER JOIN ocrm_vt_area d ON a.dev_area_id=d.area_id
>>  LEFT OUTER JOIN osor_code e ON a.data_internet_flg=e.code_val and
e.code_tp='IS_INTERNET_FLG'
>>  LEFT OUTER JOIN osor_code f ON a.wb_access_tp=f.code_val and
f.code_tp='WEB_ACCESS_TP'
>>  LEFT OUTER JOIN osor_code g ON a.free_tp=g.code_val and
g.code_tp='FREE_TP';
>> For each jobs, most of the time is consumed by the reduce jobs. As the
idap_pi_root is very large, to scan over it for 6 times is quite
inefficient. Is it possible to reduce the map/reduce jobs to only one?
>> Thanks,
>> Weidong Bian

Reduce the number of map/reduce jobs during join

Reply via email to