[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159319#comment-15159319 ]
Junping Du commented on MAPREDUCE-6608: --------------------------------------- Thanks [~srikanth.sampath] for updating the design doc and uploading an outstanding demo patch! Sorry for reply a little late as just come back from a vacation... Finally, I got chance to review the latest document and the demo patch. +1 on Vinod's proposal of separating write and read path. This solution is even better than my proposal (HDFS way) above as no single point access means better scalability. The only problem here is the implementation is more complicated as it involves new RPC service in NM (client side is task) and more payload between NM-RM heartbeat, so we should separate it out a dedicated YARN JIRA to track the work. Other quick comments on the design doc: bq. The work preserving feature of an MR Application can be set at an application level, when the application is submitted. Sounds good. We can involve a new MR config to switch on/off this feature (off by default). However, I didn't see any implementation on this in demo patch and I think we should add it in the beginning as we want to keep old behavior (code path) unchanged in case feature is off. bq. When the AM starts up, the registry operations is started as a service. An AM creates a service record id being the JobId and persistence being at the application level. It then stores the address(host, port) as an internal endpoint. Beside we need to replace the read path of registry service, another point is we don't necessary to keep the first attempt AM info which could saving most of overhead we are adding here as most applications are expected to end with single attempt. Isn't it? bq. Currently, YarnChild uses positional arguments as parameters. This will be enhanced to use named arguments as parameters. For work preserving jobs, the path to the service record is passed as the parameter to determine the address of the AM. Agree that named argument sounds better. However, this way has work for a long time for MapReduce project and we won't prefer to change unless we find some issue or bug. For path to service record, we need keep consistent with our decision on read path. bq. Thus UmbilicalWithRetries is a wrapper over Umbilical with retries implemented. Depending on whether the AM is workpreserving or not, a factory method creates either a vanilla umbilical or one with retries. UmbilicalWithRetries should follow other existing practice (for RPC client retry during service down time) that to create a RetryProxy with FailoverProxyProvider (may be call it as MRAMProxy) for task attempt to contact with new attempt instance for AM. TaskManagement part look good to me. > Work Preserving AM Restart for MapReduce > ---------------------------------------- > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Srikanth Sampath > Assignee: Srikanth Sampath > Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, > WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)