Ok. Thanks for the clarification. It's to run an HBase job, so it will be one node restriction for me.
JM 2012/12/8, Harsh J <ha...@cloudera.com>: > In case of HBase, the locality is bound to be restricted to one node > (the node hosting the region asked for). Otherwise, replication > affects locality (N options). > > On Sat, Dec 8, 2012 at 11:27 PM, Jean-Marc Spaggiari > <jean-m...@spaggiari.org> wrote: >> Hi Harsh, >> >> Thanks for your help. >> >> mapred.fairscheduler.locality.delay seems to be working very well for >> me. I have set it with 60s and JoInProgress picked up only "Choosing >> data-local task"... It seems to do the job for my usecase. And as you >> are saying, if I'm loosing a node while the job is running, the task >> will still run after 60 seconds on another node >> >> I have not yet looked at CapacityScheduler, but will most probably later. >> >> One last thing. I have a replication factor set to 3. Does it mean 3 >> TaskTrackers might be able to take any of the tasks and run them >> locally? Or only 1? >> >> Thanks, >> >> JM >> >> 2012/12/8, Harsh J <ha...@cloudera.com>: >>> Answer depends on a couple of features to be present in your version >>> of Hadoop, and is inline. >>> >>> On Fri, Dec 7, 2012 at 11:38 PM, Jean-Marc Spaggiari >>> <jean-m...@spaggiari.org> wrote: >>>> Hi, >>>> >>>> Is there a way for force the tasks from a MR job to run ONLY on the >>>> taskservers where the input split location is? >>> >>> There is no severely strict version to do this, but there are >>> improvements you could make to configuration to make conditions more >>> favorable to have data local tasks. >>> >>>> I mean, on the taskdetails UI, I can see all my tasks (25), and some >>>> of them have Machine == Input split Location. But some don't. >>> >>> It is sometimes normal to see non-data-local tasks among mostly >>> data-local tasks in MR - this is due to availability of >>> slots/resources during job scheduling. >>> >>>> So I'm wondering if there is a way to force hadoop to run those tasks >>>> "locally" or else discard them and wait for the local server to be >>>> able to run them? >>> >>> You need a good scheduler that can address your needs. >>> >>> For FairScheduler, in 1.x or so, you can utilize >>> mapred.fairscheduler.locality.delay, set in milliseconds in your >>> mapred-site.xml, to indicate the maximum period of wait for a task to >>> get scheduled with demanded locality. Ideally you'd want to set this >>> to a period slightly greater than the average time between two >>> heartbeats from a single tasktracker to the jobtracker. The 2.x one >>> does it automatically, seems like. >>> >>> For CapacityScheduler, there isn't any form of delay factor in 1.x >>> releases. In 2.x however, CapacityScheduler has the >>> yarn.scheduler.capacity.node-locality-delay config property that can >>> be set for a similar effect. >>> >>> Note that the reason MR does not do absolutely strict scheduling is >>> for many reasons, one of them also being to counter failure or >>> unavailability of the target node for an assumed infinite period. Most >>> users would not prefer their tasks to hang in wait forever due to any >>> of such situations, and a few non-data local tasks in the job don't >>> hurt the overall execution time too much. >>> >>> -- >>> Harsh J >>> > > > > -- > Harsh J >