Regarding FIFO scheduler

2011-09-22 Thread Praveen Sripati
Hi,

Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map
tasks) and the cluster has a capacity of 150 map tasks (15 nodes with 10 map
tasks per node) and Hadoop is using the default FIFO scheduler. If I submit
first J1 and then J2, will the jobs run in parallel or the job J1 has to be
completed before the job J2 starts.

I was reading 'Hadoop - The Definitive Guide'  and it says Early versions
of Hadoop had a very simple approach to scheduling users’ jobs: they ran in
order of submission, using a FIFO scheduler. Typically, each job would use
the whole cluster, so jobs had to wait their turn.

Thanks,
Praveen


Re: Regarding FIFO scheduler

2011-09-22 Thread Joey Echeverria
In most cases, your job will have more map tasks than map slots. You
want the reducers to spin up at some point before all your maps
complete, so that the shuffle and sort can work in parallel with some
of your map tasks. I usually set slow start to 80%, sometimes higher
if I know the maps are slow and they do a lot of filtering, so there
isn't too much intermediate data.

-Joey

On Thu, Sep 22, 2011 at 6:38 AM, Praveen Sripati
praveensrip...@gmail.com wrote:
 Joey,

 Thanks for the response.

 'mapreduce.job.reduce.slowstart.completedmaps' is default set to 0.05 and
 says 'Fraction of the number of maps in the job which should be complete
 before reduces are scheduled for the job.'

 Shouldn't the map tasks be completed before the reduce tasks are kicked for
 a particular job?

 Praveen

 On Thu, Sep 22, 2011 at 6:53 PM, Joey Echeverria j...@cloudera.com wrote:

 The jobs would run in parallel since J1 doesn't use all of your map
 tasks. Things get more interesting with reduce slots. If J1 is an
 overall slower job, and you haven't configured
 mapred.reduce.slowstart.completed.maps, then J1 could launch a bunch
 of idle reduce tasks which would starve J2.

 In general, it's best to configure the slow start property and to use
 the fair scheduler or capacity scheduler.

 -Joey

 On Thu, Sep 22, 2011 at 6:05 AM, Praveen Sripati
 praveensrip...@gmail.com wrote:
  Hi,
 
  Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map
  tasks) and the cluster has a capacity of 150 map tasks (15 nodes with 10
  map
  tasks per node) and Hadoop is using the default FIFO scheduler. If I
  submit
  first J1 and then J2, will the jobs run in parallel or the job J1 has to
  be
  completed before the job J2 starts.
 
  I was reading 'Hadoop - The Definitive Guide'  and it says Early
  versions
  of Hadoop had a very simple approach to scheduling users’ jobs: they ran
  in
  order of submission, using a FIFO scheduler. Typically, each job would
  use
  the whole cluster, so jobs had to wait their turn.
 
  Thanks,
  Praveen
 



 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434





-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: Regarding FIFO scheduler

2011-09-22 Thread Praveen Sripati
Thanks, got the point. So, the shuffle and sort can happen in parallel even
before all the map tasks are completed, but the reduce happens only after
all the map tasks are complete.

Praveen

On Thu, Sep 22, 2011 at 7:13 PM, Joey Echeverria j...@cloudera.com wrote:

 In most cases, your job will have more map tasks than map slots. You
 want the reducers to spin up at some point before all your maps
 complete, so that the shuffle and sort can work in parallel with some
 of your map tasks. I usually set slow start to 80%, sometimes higher
 if I know the maps are slow and they do a lot of filtering, so there
 isn't too much intermediate data.

 -Joey

 On Thu, Sep 22, 2011 at 6:38 AM, Praveen Sripati
 praveensrip...@gmail.com wrote:
  Joey,
 
  Thanks for the response.
 
  'mapreduce.job.reduce.slowstart.completedmaps' is default set to 0.05 and
  says 'Fraction of the number of maps in the job which should be complete
  before reduces are scheduled for the job.'
 
  Shouldn't the map tasks be completed before the reduce tasks are kicked
 for
  a particular job?
 
  Praveen
 
  On Thu, Sep 22, 2011 at 6:53 PM, Joey Echeverria j...@cloudera.com
 wrote:
 
  The jobs would run in parallel since J1 doesn't use all of your map
  tasks. Things get more interesting with reduce slots. If J1 is an
  overall slower job, and you haven't configured
  mapred.reduce.slowstart.completed.maps, then J1 could launch a bunch
  of idle reduce tasks which would starve J2.
 
  In general, it's best to configure the slow start property and to use
  the fair scheduler or capacity scheduler.
 
  -Joey
 
  On Thu, Sep 22, 2011 at 6:05 AM, Praveen Sripati
  praveensrip...@gmail.com wrote:
   Hi,
  
   Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map
   tasks) and the cluster has a capacity of 150 map tasks (15 nodes with
 10
   map
   tasks per node) and Hadoop is using the default FIFO scheduler. If I
   submit
   first J1 and then J2, will the jobs run in parallel or the job J1 has
 to
   be
   completed before the job J2 starts.
  
   I was reading 'Hadoop - The Definitive Guide'  and it says Early
   versions
   of Hadoop had a very simple approach to scheduling users’ jobs: they
 ran
   in
   order of submission, using a FIFO scheduler. Typically, each job would
   use
   the whole cluster, so jobs had to wait their turn.
  
   Thanks,
   Praveen
  
 
 
 
  --
  Joseph Echeverria
  Cloudera, Inc.
  443.305.9434
 
 



 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434