[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159319#comment-15159319
 ] 

Junping Du commented on MAPREDUCE-6608:
---------------------------------------

Thanks [~srikanth.sampath] for updating the design doc and uploading an 
outstanding demo patch!
Sorry for reply a little late as just come back from a vacation... Finally, I 
got chance to review the latest document and the demo patch. 

+1 on Vinod's proposal of separating write and read path. This solution is even 
better than my proposal (HDFS way) above as no single point access means better 
scalability. The only problem here is the implementation is more complicated as 
it involves new RPC service in NM (client side is task) and more payload 
between NM-RM heartbeat, so we should separate it out a dedicated YARN JIRA to 
track the work.

Other quick comments on the design doc:

bq.  The work preserving feature of an MR Application can be set at an 
application level, when the application is submitted.
Sounds good. We can involve a new MR config to switch on/off this feature (off 
by default). However, I didn't see any implementation on this in demo patch and 
I think we should add it in the beginning as we want to keep old behavior (code 
path) unchanged in case feature is off.

bq. When the AM starts up, the registry operations is started as a service. An 
AM creates a service record id being the JobId and persistence being at the 
application level. It then stores the address(host, port) as an internal 
endpoint.
Beside we need to replace the read path of registry service, another point is 
we don't necessary to keep the first attempt AM info which could saving most of 
overhead we are adding here as most applications are expected to end with 
single attempt. Isn't it?

bq. Currently, YarnChild uses positional arguments as parameters. This will be 
enhanced to use named arguments as parameters. For work preserving jobs, the 
path to the service record is passed as the parameter to determine the address 
of the AM.
Agree that named argument sounds better. However, this way has work for a long 
time for MapReduce project and we won't prefer to change unless we find some 
issue or bug. For path to service record, we need keep consistent with our 
decision on read path.

bq. Thus UmbilicalWithRetries is a wrapper over Umbilical with retries 
implemented. Depending on whether the AM is workpreserving or not, a factory 
method creates either a vanilla umbilical or one with retries.
UmbilicalWithRetries should follow other existing practice (for RPC client 
retry during service down time) that to create a RetryProxy with 
FailoverProxyProvider (may be call it as MRAMProxy) for task attempt to contact 
with new attempt instance for AM. 

TaskManagement part look good to me.

> Work Preserving AM Restart for MapReduce
> ----------------------------------------
>
>                 Key: MAPREDUCE-6608
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Srikanth Sampath
>            Assignee: Srikanth Sampath
>         Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to