Re: The location of the map execution

Joey Echeverria Sun, 04 Mar 2012 12:32:50 -0800

Most people use either the fair or capacity schedulers. If you read those links 
I sent earlier, you can decide which better fits your use cases.


-Joey

Sent from my iPhone

On Mar 4, 2012, at 14:44, Mohit Anchlia <mohitanch...@gmail.com> wrote:

> 
> 
> On Sun, Mar 4, 2012 at 4:15 AM, Joey Echeverria <j...@cloudera.com> wrote:
> I misspoke in my previous e-mail. The default scheduler does do data
> local scheduling, but it's not perfect. When using the default
> scheduler, tasks are assigned to TaskTrackers on every heart beat.
> When a TaskTracker checks in, the JobTracker will assign any tasks
> that are node-local or rack-local. When you run a job with a single
> map task, it's very likely that a rack-local TaskTracker will become
> available before a node-local one does. This means that for jobs with
> a small task count, you're less likely to get data locality. For jobs
> with a task count close to or greater than the number of TaskTrackers,
> you're much more likely to get node-local assignments.
>  
> Thanks for the clarification. It helps a lot. I am learning things every day. 
> In my case my input splits are somewhere in between 200-300. Does it still 
> make sense to use FairScheduler? What do people generally use?
> 
> -Joey
> 
> On Sat, Mar 3, 2012 at 10:44 PM, Mohit Anchlia <mohitanch...@gmail.com> wrote:
> > On Sat, Mar 3, 2012 at 7:41 PM, Joey Echeverria <j...@cloudera.com> wrote:
> >>
> >> Sorry, I meant have you set the mapred.jobtracker.taskScheduler
> >> property in your mapred-site.xml file. If not, you're using the
> >> standard, FIFO scheduler. The default scheduler doesn't do data-local
> >> scheduling, but the fair scheduler and capacity scheduler do. You want
> >> to set mapred.jobtracker.taskScheduler to either
> >> org.apache.hadoop.mapred.FairScheduler (for the fair scheduler) or
> >> org.apache.hadoop.mapred.CapacityTaskScheduler (for the capacity
> >> scheduler) and then restart the JobTracker. You can read about the two
> >> schedulers here:
> >>
> >> http://hadoop.apache.org/common/docs/current/fair_scheduler.html
> >> http://hadoop.apache.org/common/docs/current/capacity_scheduler.html
> >>
> >
> > I thought by default tasks are scheduled on those nodes that have those data
> > blocks. I thought that was inherent. In the faire scheduler link I don't see
> > anything about data-local
> >
> >> -Joey
> >>
> >> On Sat, Mar 3, 2012 at 6:32 PM, Hassen Riahi <hassen.ri...@cern.ch> wrote:
> >> > The jobtracker is running in another machine (node C)
> >> >
> >> > Hassen
> >> >
> >> >
> >> >> Which scheduler are you using?
> >> >>
> >> >> -Joey
> >> >>
> >> >> On Mar 3, 2012, at 18:52, Hassen Riahi <hassen.ri...@cern.ch> wrote:
> >> >>
> >> >>> Hi all,
> >> >>>
> >> >>> We tried using mapreduce to execute a simple map code which read a txt
> >> >>> file stored in HDFS and write then the output.
> >> >>> The file to read is a very small one. It was not split and written
> >> >>> entirely and only in a single datanode (node A). This node is
> >> >>> configured
> >> >>> also as a tasktracker node
> >> >>> While we was expecting that the location of the map execution is node
> >> >>> A
> >> >>> (since the input is stored there), from log files, we see that the map
> >> >>> was
> >> >>> executed in another tasktracker (node B) of the cluster.
> >> >>> Am I missing something?
> >> >>>
> >> >>> Thanks for the help!
> >> >>> Hassen
> >> >>>
> >> >
> >>
> >>
> >>
> >> --
> >> Joseph Echeverria
> >> Cloudera, Inc.
> >> 443.305.9434
> >
> >
> 
> 
> 
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>

Re: The location of the map execution

Reply via email to