Hi Brad,
YARN scheduling does take care of data locality. In YARN, tasks are not
assigned based on capacity. Actually certain number of containers are
allocated on every node based on node's capacity. Tasks are executed on
those containers. While scheduling tasks on containers YARN scheduler
satisfies data locality requirements. I am not very familiar with Fair
Scheduler but if you check the source of FifoScheduler you will find a
function 'assignContainersonNode' which looks like following
private int assignContainersOnNode(FiCaSchedulerNode node,
FiCaSchedulerApp application, Priority priority
) {
// Data-local
int nodeLocalContainers =
assignNodeLocalContainers(node, application, priority);
// Rack-local
int rackLocalContainers =
assignRackLocalContainers(node, application, priority);
// Off-switch
int offSwitchContainers =
assignOffSwitchContainers(node, application, priority);
LOG.debug("assignContainersOnNode:" +
" node=" + node.getRMNode().getNodeAddress() +
" application=" + application.getApplicationId().getId() +
" priority=" + priority.getPriority() +
" #assigned=" +
(nodeLocalContainers + rackLocalContainers + offSwitchContainers));
return (nodeLocalContainers + rackLocalContainers +
offSwitchContainers);
}
In this routine you will find that data-local tasks are scheduled first,
then rack-local and in then off-switch.
After this you may find similar function in fairScheduler too.
I hope this helps. Let me know if you more questions or if something is
wrong in my reasoning.
Regards,
Shekhar
On Thu, Apr 3, 2014 at 10:56 AM, Brad Childs <[email protected]> wrote:
> Sorry if this is the wrong list, i am looking for deep technical/hadoop
> source help :)
>
> How does job scheduling work on yarn framework for map reduce jobs? I see
> the yarn scheduler discussed here:
> https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YARN.html
> which leads me to believe tasks are scheduled based on node capacity and
> not data locality. I've sifted through the fair scheduler and can't find
> anything about data location or locality.
>
> Where does data locality play into the scheduling of map/reduce tasks on
> yarn? Can someone point me to the hadoop 2.x source where the data block
> location is used to calculate node/container/task assignment (if thats
> still happening).
>
>
>
> -bc
>
>