Regarding FIFO scheduler
Hi, Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map tasks) and the cluster has a capacity of 150 map tasks (15 nodes with 10 map tasks per node) and Hadoop is using the default FIFO scheduler. If I submit first J1 and then J2, will the jobs run in parallel or the job J1 has to be completed before the job J2 starts. I was reading 'Hadoop - The Definitive Guide' and it says Early versions of Hadoop had a very simple approach to scheduling users’ jobs: they ran in order of submission, using a FIFO scheduler. Typically, each job would use the whole cluster, so jobs had to wait their turn. Thanks, Praveen
Re: Regarding FIFO scheduler
In most cases, your job will have more map tasks than map slots. You want the reducers to spin up at some point before all your maps complete, so that the shuffle and sort can work in parallel with some of your map tasks. I usually set slow start to 80%, sometimes higher if I know the maps are slow and they do a lot of filtering, so there isn't too much intermediate data. -Joey On Thu, Sep 22, 2011 at 6:38 AM, Praveen Sripati praveensrip...@gmail.com wrote: Joey, Thanks for the response. 'mapreduce.job.reduce.slowstart.completedmaps' is default set to 0.05 and says 'Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.' Shouldn't the map tasks be completed before the reduce tasks are kicked for a particular job? Praveen On Thu, Sep 22, 2011 at 6:53 PM, Joey Echeverria j...@cloudera.com wrote: The jobs would run in parallel since J1 doesn't use all of your map tasks. Things get more interesting with reduce slots. If J1 is an overall slower job, and you haven't configured mapred.reduce.slowstart.completed.maps, then J1 could launch a bunch of idle reduce tasks which would starve J2. In general, it's best to configure the slow start property and to use the fair scheduler or capacity scheduler. -Joey On Thu, Sep 22, 2011 at 6:05 AM, Praveen Sripati praveensrip...@gmail.com wrote: Hi, Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map tasks) and the cluster has a capacity of 150 map tasks (15 nodes with 10 map tasks per node) and Hadoop is using the default FIFO scheduler. If I submit first J1 and then J2, will the jobs run in parallel or the job J1 has to be completed before the job J2 starts. I was reading 'Hadoop - The Definitive Guide' and it says Early versions of Hadoop had a very simple approach to scheduling users’ jobs: they ran in order of submission, using a FIFO scheduler. Typically, each job would use the whole cluster, so jobs had to wait their turn. Thanks, Praveen -- Joseph Echeverria Cloudera, Inc. 443.305.9434 -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: Regarding FIFO scheduler
Thanks, got the point. So, the shuffle and sort can happen in parallel even before all the map tasks are completed, but the reduce happens only after all the map tasks are complete. Praveen On Thu, Sep 22, 2011 at 7:13 PM, Joey Echeverria j...@cloudera.com wrote: In most cases, your job will have more map tasks than map slots. You want the reducers to spin up at some point before all your maps complete, so that the shuffle and sort can work in parallel with some of your map tasks. I usually set slow start to 80%, sometimes higher if I know the maps are slow and they do a lot of filtering, so there isn't too much intermediate data. -Joey On Thu, Sep 22, 2011 at 6:38 AM, Praveen Sripati praveensrip...@gmail.com wrote: Joey, Thanks for the response. 'mapreduce.job.reduce.slowstart.completedmaps' is default set to 0.05 and says 'Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.' Shouldn't the map tasks be completed before the reduce tasks are kicked for a particular job? Praveen On Thu, Sep 22, 2011 at 6:53 PM, Joey Echeverria j...@cloudera.com wrote: The jobs would run in parallel since J1 doesn't use all of your map tasks. Things get more interesting with reduce slots. If J1 is an overall slower job, and you haven't configured mapred.reduce.slowstart.completed.maps, then J1 could launch a bunch of idle reduce tasks which would starve J2. In general, it's best to configure the slow start property and to use the fair scheduler or capacity scheduler. -Joey On Thu, Sep 22, 2011 at 6:05 AM, Praveen Sripati praveensrip...@gmail.com wrote: Hi, Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map tasks) and the cluster has a capacity of 150 map tasks (15 nodes with 10 map tasks per node) and Hadoop is using the default FIFO scheduler. If I submit first J1 and then J2, will the jobs run in parallel or the job J1 has to be completed before the job J2 starts. I was reading 'Hadoop - The Definitive Guide' and it says Early versions of Hadoop had a very simple approach to scheduling users’ jobs: they ran in order of submission, using a FIFO scheduler. Typically, each job would use the whole cluster, so jobs had to wait their turn. Thanks, Praveen -- Joseph Echeverria Cloudera, Inc. 443.305.9434 -- Joseph Echeverria Cloudera, Inc. 443.305.9434