Hi Andre, Try setting yarn.scheduler.capacity.node-locality-delay to a number between 0 and 1. This will turn on delay scheduling - here's the doc on how this works:
For applications that request containers on particular nodes, the number of scheduling opportunities since the last container assignment to wait before accepting a placement on another node. Expressed as a float between 0 and 1, which, as a fraction of the cluster size, is the number of scheduling opportunities to pass up. The default value of -1.0 means don't pass up any scheduling opportunities. -Sandy On Thu, Oct 3, 2013 at 9:57 AM, André Hacker <andrephac...@gmail.com> wrote: > Hi, > > I have a 25 node cluster, running hadoop 2.1.0-beta, with capacity > scheduler (default settings for scheduler) and replication factor 3. > > I have exclusive access to the cluster to run a benchmark job and I wonder > why there are so few data-local and so many rack-local maps. > > The input format calculates 44 input splits and 44 map tasks, however, it > seems to be random how many of them are processed data locally. Here the > counters of my last tries: > > data-local / rack-local: > Test 1: data-local:15 rack-local: 29 > Test 2: data-local:18 rack-local: 26 > > I don't understand why there is not always 100% data local. This should > not be a problem since the blocks of my input file are distributed over all > nodes. > > Maybe someone can give me a hint. > > Thanks, > André Hacker, TU Berlin >