Most people use either the fair or capacity schedulers. If you read those links I sent earlier, you can decide which better fits your use cases.
-Joey Sent from my iPhone On Mar 4, 2012, at 14:44, Mohit Anchlia <mohitanch...@gmail.com> wrote: > > > On Sun, Mar 4, 2012 at 4:15 AM, Joey Echeverria <j...@cloudera.com> wrote: > I misspoke in my previous e-mail. The default scheduler does do data > local scheduling, but it's not perfect. When using the default > scheduler, tasks are assigned to TaskTrackers on every heart beat. > When a TaskTracker checks in, the JobTracker will assign any tasks > that are node-local or rack-local. When you run a job with a single > map task, it's very likely that a rack-local TaskTracker will become > available before a node-local one does. This means that for jobs with > a small task count, you're less likely to get data locality. For jobs > with a task count close to or greater than the number of TaskTrackers, > you're much more likely to get node-local assignments. > > Thanks for the clarification. It helps a lot. I am learning things every day. > In my case my input splits are somewhere in between 200-300. Does it still > make sense to use FairScheduler? What do people generally use? > > -Joey > > On Sat, Mar 3, 2012 at 10:44 PM, Mohit Anchlia <mohitanch...@gmail.com> wrote: > > On Sat, Mar 3, 2012 at 7:41 PM, Joey Echeverria <j...@cloudera.com> wrote: > >> > >> Sorry, I meant have you set the mapred.jobtracker.taskScheduler > >> property in your mapred-site.xml file. If not, you're using the > >> standard, FIFO scheduler. The default scheduler doesn't do data-local > >> scheduling, but the fair scheduler and capacity scheduler do. You want > >> to set mapred.jobtracker.taskScheduler to either > >> org.apache.hadoop.mapred.FairScheduler (for the fair scheduler) or > >> org.apache.hadoop.mapred.CapacityTaskScheduler (for the capacity > >> scheduler) and then restart the JobTracker. You can read about the two > >> schedulers here: > >> > >> http://hadoop.apache.org/common/docs/current/fair_scheduler.html > >> http://hadoop.apache.org/common/docs/current/capacity_scheduler.html > >> > > > > I thought by default tasks are scheduled on those nodes that have those data > > blocks. I thought that was inherent. In the faire scheduler link I don't see > > anything about data-local > > > >> -Joey > >> > >> On Sat, Mar 3, 2012 at 6:32 PM, Hassen Riahi <hassen.ri...@cern.ch> wrote: > >> > The jobtracker is running in another machine (node C) > >> > > >> > Hassen > >> > > >> > > >> >> Which scheduler are you using? > >> >> > >> >> -Joey > >> >> > >> >> On Mar 3, 2012, at 18:52, Hassen Riahi <hassen.ri...@cern.ch> wrote: > >> >> > >> >>> Hi all, > >> >>> > >> >>> We tried using mapreduce to execute a simple map code which read a txt > >> >>> file stored in HDFS and write then the output. > >> >>> The file to read is a very small one. It was not split and written > >> >>> entirely and only in a single datanode (node A). This node is > >> >>> configured > >> >>> also as a tasktracker node > >> >>> While we was expecting that the location of the map execution is node > >> >>> A > >> >>> (since the input is stored there), from log files, we see that the map > >> >>> was > >> >>> executed in another tasktracker (node B) of the cluster. > >> >>> Am I missing something? > >> >>> > >> >>> Thanks for the help! > >> >>> Hassen > >> >>> > >> > > >> > >> > >> > >> -- > >> Joseph Echeverria > >> Cloudera, Inc. > >> 443.305.9434 > > > > > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434 >