Re: Queries on next gen MR architecture

Arun C Murthy Sat, 07 Jan 2012 10:39:17 -0800

On Jan 5, 2012, at 8:29 AM, Praveen Sripati wrote:

> Hi,
> 
> I had been going through the MRv2 documentation and have the following queries
> 
> 1) Let's say that an InputSplit is on Node1 and Node2.
> 
> Can the ApplicationMaster ask the ResourceManager for a container either on 
> Node1 or Node2 with an OR condition?
>


No, the OR condition is implied by the hierarchy of requests (node, rack, *).

In this case, assuming topology is node1/rack1 node2/rack1, requests would be:
node1 -> 1
node2 -> 1
rack1 -> 1
* -> 1

OTOH, if the topology is node1/rack1, node2/rack2, requests would be:
node1 -> 1
node2 -> 1
rack1 -> 1
rack2 -> 1
* -> 1

In both cases, * would limit the #allocated-containers to '1'.

In the first case rack1 itself (independent of *) would limit 
#allocated-containers to 1.

More details here: 
http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/.
 

I'll work on incorporating this into our docs on hadoop.apache.org.

> 2) > The Scheduler receives periodic information about the resource usages on 
> allocated resources from the NodeManagers. The Scheduler also makes available 
> status of completed Containers to the appropriate ApplicationMaster.
> 
> What's the use of NM sending the resource usages to the scheduler?
> 
> Why can't the NM directly talk to the AM about the completed containers? Does 
> any information pass from NM to AM?
> 

The NM sends resource usages to the scheduler to allow it to track resource 
utilization on each node and, in future, make smarter decisions about 
allocating extra containers on under-utilized nodes etc.
 
NM doesn't make any 'out' calls to anyone by RM, else it would be severe 
scalability bottleneck.

> 3) >The Map-Reduce ApplicationMaster has the following components:
> > TaskUmbilical – The component responsible for receiving heartbeats and 
> > status updates form the map and reduce tasks.
> 
> Does the communication happen directly between the container and the AM? If 
> yes, the task completion status could also be sent from the container to the 
> AM.
> 

Yes, it already is sent to AM.

> 4) > The Hadoop Map-Reduce JobClient polls the ASM to obtain information 
> about the MR AM and then directly talks to the AM for status, counters etc.
> 
> Once the Job is completed the AM goes down, what happens to the Counters? 
> What is the flow of the Counter (Container -> NM -> AM)?
> 

Once jobs completes the Counters etc. are stored in JobHistory file (one per 
job) which is served up, if necessary, by the JobHistoryServer.

> 5) If a new YARN application is created. How can the NM trust the request 
> from AM?
> 

All interactions (RPCs) are authenticated. Also, there is a container token 
provided by the RM (during allocation) which is verified by the NM during 
container launch.

> 6) > MapReduce NextGen uses wire-compatible protocols to allow different 
> versions of servers and clients to communicate with each other.
> 
> What is meant by the `wire-compatible protocols` and how is it implemented?
> 

We use PB everywhere.

> 7) > The computation framework (ResourceManager and NodeManager) is 
> completely generic and is free of MapReduce specificities.
> 
> Is this the reason for adding auxiliary services for shuffling to the NM?
> 

Yes.

hth,
Arun

Re: Queries on next gen MR architecture

Reply via email to