Re: Reduce the number of map/reduce jobs during join

2012-03-13 Thread Jagat
Hello Weidong Bian

Did you see the following configuration properties in conf directory


property
  namemapred.reduce.tasks/name
  value-1/value
descriptionThe default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is local. Hadoop set this to 1 by default, whereas
hive uses -1 as its default value.
  By setting this property to -1, Hive will automatically figure out what
should be the number of reducers.
  /description
/property


property
  namehive.exec.reducers.max/name
  value999/value
  descriptionmax number of reducers will be used. If the one
specified in the configuration parameter mapred.reduce.tasks is
negative, hive will use this one as the max number of reducers when
automatically determine number of reducers./description
/property

Thanks and Regards

Jagat


On Tue, Mar 13, 2012 at 9:54 PM, Bruce Bian weidong@gmail.com wrote:

 Hi there,
 when I'm using Hive to doing a query as follows, 6 Map/Reduce jobs are
 launched, one for each join, and it deals with ~460M data in ~950 seconds,
 which I think is way t slow for a cluster with 5 slaves and 24GB
 memory/12 disks each.

 set mapred.reduce.tasks=5;
 SELECT a.*,e.code_name as is_internet_flg, f.code_name as
 wb_access_tp_desc, g.code_name as free_tp_desc,
 b.acnt_no,b.addr_id,b.postcode,b.acnt_rmnd_tp,b.print_tp,b.media_type,
 c.cust_code,c.root_cust_code,

 d.mdf_name,d.sub_bureau_code,d.bureau_cd,d.adm_sub_bureau_name,d.bureau_name
 FROM prc_idap_pi_root a
  LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id
  LEFT OUTER JOIN idap_pi_root_cust c ON a.cust_id=c.cust_id
  LEFT OUTER JOIN ocrm_vt_area d ON a.dev_area_id=d.area_id
  LEFT OUTER JOIN osor_code e ON a.data_internet_flg=e.code_val and
 e.code_tp='IS_INTERNET_FLG'
  LEFT OUTER JOIN osor_code f ON a.wb_access_tp=f.code_val and
 f.code_tp='WEB_ACCESS_TP'
  LEFT OUTER JOIN osor_code g ON a.free_tp=g.code_val and
 g.code_tp='FREE_TP';

 For each jobs, most of the time is consumed by the reduce jobs. As the
 idap_pi_root is very large, to scan over it for 6 times is quite
 inefficient. Is it possible to reduce the map/reduce jobs to only one?

 Thanks,
 Weidong Bian



Reduce the number of map/reduce jobs during join

2012-03-13 Thread Bruce Bian
Yes,it's in my hive-default.xml and Hive figured to use one reducer only,
so I thought increase it to 5 might help,which doesn't.
Anyway, to scan the largest table 6 times isn't efficient hence my question.

On Wed, Mar 14, 2012 at 12:37 AM, Jagat jagatsi...@gmail.com wrote:

 Hello Weidong Bian

 Did you see the following configuration properties in conf directory


 property
   namemapred.reduce.tasks/name
   value-1/value
 descriptionThe default number of reduce tasks per job.  Typically
set
   to a prime close to the number of available hosts.  Ignored when
   mapred.job.tracker is local. Hadoop set this to 1 by default, whereas
hive uses -1 as its default value.
   By setting this property to -1, Hive will automatically figure out what
should be the number of reducers.
   /description
 /property


 property
   namehive.exec.reducers.max/name
   value999/value
   descriptionmax number of reducers will be used. If the one
 specified in the configuration parameter mapred.reduce.tasks is
 negative, hive will use this one as the max number of reducers when
 automatically determine number of reducers./description
 /property

 Thanks and Regards

 Jagat


 On Tue, Mar 13, 2012 at 9:54 PM, Bruce Bian weidong@gmail.com wrote:

 Hi there,
 when I'm using Hive to doing a query as follows, 6 Map/Reduce jobs are
launched, one for each join, and it deals with ~460M data in ~950 seconds,
which I think is way t slow for a cluster with 5 slaves and 24GB
memory/12 disks each.
 set mapred.reduce.tasks=5;
 SELECT a.*,e.code_name as is_internet_flg, f.code_name as
wb_access_tp_desc, g.code_name as free_tp_desc,
 b.acnt_no,b.addr_id,b.postcode,b.acnt_rmnd_tp,b.print_tp,b.media_type,
 c.cust_code,c.root_cust_code,

d.mdf_name,d.sub_bureau_code,d.bureau_cd,d.adm_sub_bureau_name,d.bureau_name
 FROM prc_idap_pi_root a
  LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id
  LEFT OUTER JOIN idap_pi_root_cust c ON a.cust_id=c.cust_id
  LEFT OUTER JOIN ocrm_vt_area d ON a.dev_area_id=d.area_id
  LEFT OUTER JOIN osor_code e ON a.data_internet_flg=e.code_val and
e.code_tp='IS_INTERNET_FLG'
  LEFT OUTER JOIN osor_code f ON a.wb_access_tp=f.code_val and
f.code_tp='WEB_ACCESS_TP'
  LEFT OUTER JOIN osor_code g ON a.free_tp=g.code_val and
g.code_tp='FREE_TP';
 For each jobs, most of the time is consumed by the reduce jobs. As the
idap_pi_root is very large, to scan over it for 6 times is quite
inefficient. Is it possible to reduce the map/reduce jobs to only one?
 Thanks,
 Weidong Bian


Re: Reduce the number of map/reduce jobs during join

2012-03-13 Thread shule ney
Do the joins share the same key?

2012/3/13 Bruce Bian weidong@gmail.com

 Yes,it's in my hive-default.xml and Hive figured to use one reducer only,
 so I thought increase it to 5 might help,which doesn't.
 Anyway, to scan the largest table 6 times isn't efficient hence my
 question.


 On Wed, Mar 14, 2012 at 12:37 AM, Jagat jagatsi...@gmail.com wrote:
 
  Hello Weidong Bian
 
  Did you see the following configuration properties in conf directory
 
 
  property
namemapred.reduce.tasks/name
value-1/value
  descriptionThe default number of reduce tasks per job.  Typically
 set
to a prime close to the number of available hosts.  Ignored when
mapred.job.tracker is local. Hadoop set this to 1 by default,
 whereas hive uses -1 as its default value.
By setting this property to -1, Hive will automatically figure out
 what should be the number of reducers.
/description
  /property
 
 
  property
namehive.exec.reducers.max/name
value999/value
descriptionmax number of reducers will be used. If the one
  specified in the configuration parameter mapred.reduce.tasks is
  negative, hive will use this one as the max number of reducers when
  automatically determine number of reducers./description
  /property
 
  Thanks and Regards
 
  Jagat
 
 
  On Tue, Mar 13, 2012 at 9:54 PM, Bruce Bian weidong@gmail.com
 wrote:
 
  Hi there,
  when I'm using Hive to doing a query as follows, 6 Map/Reduce jobs are
 launched, one for each join, and it deals with ~460M data in ~950 seconds,
 which I think is way t slow for a cluster with 5 slaves and 24GB
 memory/12 disks each.
  set mapred.reduce.tasks=5;
  SELECT a.*,e.code_name as is_internet_flg, f.code_name as
 wb_access_tp_desc, g.code_name as free_tp_desc,
  b.acnt_no,b.addr_id,b.postcode,b.acnt_rmnd_tp,b.print_tp,b.media_type,
  c.cust_code,c.root_cust_code,
 
 d.mdf_name,d.sub_bureau_code,d.bureau_cd,d.adm_sub_bureau_name,d.bureau_name
  FROM prc_idap_pi_root a
   LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id
   LEFT OUTER JOIN idap_pi_root_cust c ON a.cust_id=c.cust_id
   LEFT OUTER JOIN ocrm_vt_area d ON a.dev_area_id=d.area_id
   LEFT OUTER JOIN osor_code e ON a.data_internet_flg=e.code_val and
 e.code_tp='IS_INTERNET_FLG'
   LEFT OUTER JOIN osor_code f ON a.wb_access_tp=f.code_val and
 f.code_tp='WEB_ACCESS_TP'
   LEFT OUTER JOIN osor_code g ON a.free_tp=g.code_val and
 g.code_tp='FREE_TP';
  For each jobs, most of the time is consumed by the reduce jobs. As the
 idap_pi_root is very large, to scan over it for 6 times is quite
 inefficient. Is it possible to reduce the map/reduce jobs to only one?
  Thanks,
  Weidong Bian