Re: Reduce the number of map/reduce jobs during join

2012-03-13 Thread shule ney
Do the joins share the same key?

2012/3/13 Bruce Bian 

> Yes,it's in my hive-default.xml and Hive figured to use one reducer only,
> so I thought increase it to 5 might help,which doesn't.
> Anyway, to scan the largest table 6 times isn't efficient hence my
> question.
>
>
> On Wed, Mar 14, 2012 at 12:37 AM, Jagat  wrote:
> >
> > Hello Weidong Bian
> >
> > Did you see the following configuration properties in conf directory
> >
> >
> > 
> >   mapred.reduce.tasks
> >   -1
> > The default number of reduce tasks per job.  Typically
> set
> >   to a prime close to the number of available hosts.  Ignored when
> >   mapred.job.tracker is "local". Hadoop set this to 1 by default,
> whereas hive uses -1 as its default value.
> >   By setting this property to -1, Hive will automatically figure out
> what should be the number of reducers.
> >   
> > 
> >
> >
> > 
> >   hive.exec.reducers.max
> >   999
> >   max number of reducers will be used. If the one
> > specified in the configuration parameter mapred.reduce.tasks is
> > negative, hive will use this one as the max number of reducers when
> > automatically determine number of reducers.
> > 
> >
> > Thanks and Regards
> >
> > Jagat
> >
> >
> > On Tue, Mar 13, 2012 at 9:54 PM, Bruce Bian 
> wrote:
> >>
> >> Hi there,
> >> when I'm using Hive to doing a query as follows, 6 Map/Reduce jobs are
> launched, one for each join, and it deals with ~460M data in ~950 seconds,
> which I think is way t slow for a cluster with 5 slaves and 24GB
> memory/12 disks each.
> >> set mapred.reduce.tasks=5;
> >> SELECT a.*,e.code_name as is_internet_flg, f.code_name as
> wb_access_tp_desc, g.code_name as free_tp_desc,
> >> b.acnt_no,b.addr_id,b.postcode,b.acnt_rmnd_tp,b.print_tp,b.media_type,
> >> c.cust_code,c.root_cust_code,
> >>
> d.mdf_name,d.sub_bureau_code,d.bureau_cd,d.adm_sub_bureau_name,d.bureau_name
> >> FROM prc_idap_pi_root a
> >>  LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id
> >>  LEFT OUTER JOIN idap_pi_root_cust c ON a.cust_id=c.cust_id
> >>  LEFT OUTER JOIN ocrm_vt_area d ON a.dev_area_id=d.area_id
> >>  LEFT OUTER JOIN osor_code e ON a.data_internet_flg=e.code_val and
> e.code_tp='IS_INTERNET_FLG'
> >>  LEFT OUTER JOIN osor_code f ON a.wb_access_tp=f.code_val and
> f.code_tp='WEB_ACCESS_TP'
> >>  LEFT OUTER JOIN osor_code g ON a.free_tp=g.code_val and
> g.code_tp='FREE_TP';
> >> For each jobs, most of the time is consumed by the reduce jobs. As the
> idap_pi_root is very large, to scan over it for 6 times is quite
> inefficient. Is it possible to reduce the map/reduce jobs to only one?
> >> Thanks,
> >> Weidong Bian
>
>


Reduce the number of map/reduce jobs during join

2012-03-13 Thread Bruce Bian
Yes,it's in my hive-default.xml and Hive figured to use one reducer only,
so I thought increase it to 5 might help,which doesn't.
Anyway, to scan the largest table 6 times isn't efficient hence my question.

On Wed, Mar 14, 2012 at 12:37 AM, Jagat  wrote:
>
> Hello Weidong Bian
>
> Did you see the following configuration properties in conf directory
>
>
> 
>   mapred.reduce.tasks
>   -1
> The default number of reduce tasks per job.  Typically
set
>   to a prime close to the number of available hosts.  Ignored when
>   mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas
hive uses -1 as its default value.
>   By setting this property to -1, Hive will automatically figure out what
should be the number of reducers.
>   
> 
>
>
> 
>   hive.exec.reducers.max
>   999
>   max number of reducers will be used. If the one
> specified in the configuration parameter mapred.reduce.tasks is
> negative, hive will use this one as the max number of reducers when
> automatically determine number of reducers.
> 
>
> Thanks and Regards
>
> Jagat
>
>
> On Tue, Mar 13, 2012 at 9:54 PM, Bruce Bian  wrote:
>>
>> Hi there,
>> when I'm using Hive to doing a query as follows, 6 Map/Reduce jobs are
launched, one for each join, and it deals with ~460M data in ~950 seconds,
which I think is way t slow for a cluster with 5 slaves and 24GB
memory/12 disks each.
>> set mapred.reduce.tasks=5;
>> SELECT a.*,e.code_name as is_internet_flg, f.code_name as
wb_access_tp_desc, g.code_name as free_tp_desc,
>> b.acnt_no,b.addr_id,b.postcode,b.acnt_rmnd_tp,b.print_tp,b.media_type,
>> c.cust_code,c.root_cust_code,
>>
d.mdf_name,d.sub_bureau_code,d.bureau_cd,d.adm_sub_bureau_name,d.bureau_name
>> FROM prc_idap_pi_root a
>>  LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id
>>  LEFT OUTER JOIN idap_pi_root_cust c ON a.cust_id=c.cust_id
>>  LEFT OUTER JOIN ocrm_vt_area d ON a.dev_area_id=d.area_id
>>  LEFT OUTER JOIN osor_code e ON a.data_internet_flg=e.code_val and
e.code_tp='IS_INTERNET_FLG'
>>  LEFT OUTER JOIN osor_code f ON a.wb_access_tp=f.code_val and
f.code_tp='WEB_ACCESS_TP'
>>  LEFT OUTER JOIN osor_code g ON a.free_tp=g.code_val and
g.code_tp='FREE_TP';
>> For each jobs, most of the time is consumed by the reduce jobs. As the
idap_pi_root is very large, to scan over it for 6 times is quite
inefficient. Is it possible to reduce the map/reduce jobs to only one?
>> Thanks,
>> Weidong Bian


Re: Reduce the number of map/reduce jobs during join

2012-03-13 Thread Jagat
Hello Weidong Bian

Did you see the following configuration properties in conf directory



  mapred.reduce.tasks
  -1
The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas
hive uses -1 as its default value.
  By setting this property to -1, Hive will automatically figure out what
should be the number of reducers.
  




  hive.exec.reducers.max
  999
  max number of reducers will be used. If the one
specified in the configuration parameter mapred.reduce.tasks is
negative, hive will use this one as the max number of reducers when
automatically determine number of reducers.


Thanks and Regards

Jagat


On Tue, Mar 13, 2012 at 9:54 PM, Bruce Bian  wrote:

> Hi there,
> when I'm using Hive to doing a query as follows, 6 Map/Reduce jobs are
> launched, one for each join, and it deals with ~460M data in ~950 seconds,
> which I think is way t slow for a cluster with 5 slaves and 24GB
> memory/12 disks each.
>
> set mapred.reduce.tasks=5;
> SELECT a.*,e.code_name as is_internet_flg, f.code_name as
> wb_access_tp_desc, g.code_name as free_tp_desc,
> b.acnt_no,b.addr_id,b.postcode,b.acnt_rmnd_tp,b.print_tp,b.media_type,
> c.cust_code,c.root_cust_code,
>
> d.mdf_name,d.sub_bureau_code,d.bureau_cd,d.adm_sub_bureau_name,d.bureau_name
> FROM prc_idap_pi_root a
>  LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id
>  LEFT OUTER JOIN idap_pi_root_cust c ON a.cust_id=c.cust_id
>  LEFT OUTER JOIN ocrm_vt_area d ON a.dev_area_id=d.area_id
>  LEFT OUTER JOIN osor_code e ON a.data_internet_flg=e.code_val and
> e.code_tp='IS_INTERNET_FLG'
>  LEFT OUTER JOIN osor_code f ON a.wb_access_tp=f.code_val and
> f.code_tp='WEB_ACCESS_TP'
>  LEFT OUTER JOIN osor_code g ON a.free_tp=g.code_val and
> g.code_tp='FREE_TP';
>
> For each jobs, most of the time is consumed by the reduce jobs. As the
> idap_pi_root is very large, to scan over it for 6 times is quite
> inefficient. Is it possible to reduce the map/reduce jobs to only one?
>
> Thanks,
> Weidong Bian
>