[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15318652#comment-15318652 ] Junping Du commented on MAPREDUCE-6608: --- I have create branch MAPREDUCE-6608 for this development work. > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Srikanth Sampath > Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, > WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248151#comment-15248151 ] Junping Du commented on MAPREDUCE-6608: --- [~vinodkv], thanks for review and comments. I think most your points here are solid, however, the comments about "Output Commit of previous tasks" is a bit stale. bq. The new AM needs to make sure that output of previously running containers can be safely committed. IIRC, with today's FileOutputCommitter, new AM will only promote task-outputs that are present in $jobOutput/_temporary/$currentAttemptID/ This is true before YARN-4815. However, after YARN-4815, most task-output commit to job final output is handled by {{FileOutputCommitter.commitTask()}} instead of {{FileOutputCommitter.commitJob()}}. So the commitJob() only left work of cleanup $jobOutput/_temporary. So there is nothing need to do here unless we make sure "mapreduce.fileoutputcommitter.algorithm.version" is set to 2. This is also an assumption setting for work of MAPREDUCE-5485 which is a prerequisite for feature here - or AM will failed directly in case previous AM ends in job committing. Investigating on rest of issues and will propose some possible solutions later. bq. I'd suggest spending more time on the design, atleast on some of the areas I pointed above and then create a branch, create sub-tasks, do some prototypes etc. +1. This feature work could be a bit over my expectation before. I agree we could need a separated branch for developing this in parallel. Will create a branch once we finalize our design work. > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Srikanth Sampath > Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, > WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244727#comment-15244727 ] Srikanth Sampath commented on MAPREDUCE-6608: - Thanks [~vinodkv] for your comments. I will dig deeper into these and respond. > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Srikanth Sampath > Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, > WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243948#comment-15243948 ] Vinod Kumar Vavilapalli commented on MAPREDUCE-6608: [~srikanth.sampath] / [~djp], Got around to reading the design doc attached - there are a few important details that aren't covered in the doc, besides the AM discovery problem itself h4. Output Commit of previous tasks The new AM needs to make sure that output of previously running containers can be safely committed. IIRC, with today's FileOutputCommitter, new AM will only promote task-outputs that are present in $jobOutput/_temporary/$currentAttemptID/ Similar changes may be needed for other OutputCommitters out there. h4. Task Output Commit races It doesn't look like we record task-commit in JobHistory, so it is possible that the previous AM gave a commit go-ahead to a taskAttempt which is either (a) in the process of committing output or (b) committed the output but fails to report to either of the AMs. In this case, two taskAttempts can be committing at the same time! In the same line, without recording the success of a commit after a task finishes committing, we will run into issues. h4. Conflicting TaskAttemptIDs Today, we launch containers first and then record it in JobHistory. Because of this, if the previous AM started a TaskAttempt but crashed before recording it in JobHistory, and this oldTaskAttempt somehow cannot get reconnected to the new AM due to network issues, the new AM generates the same TaskAttemptID for a newer attempt and they both will collide on HDFS and/or the local NM output directories if they both happen to run on the same machine. The above problem will be worse when speculative tasks are involved. h4. Security AM should use the same job-token as the previous incarnation otherwise the old running tasks will get authentication failures. I quickly checked and it seems like the AM itself generates the token, which means the second AM will generate a different one and all running tasks will fail to sync back! h4. Others bq. In the WP case, upon a loss of connection to the AM the tasks will try and reestablish the connection with the new AM. This will not suffice. It is possible even today, but when a network partition occurs and two AMs end up running at the same time and give commit-go permission to two TaskAttempts of the same task, they will collide on the output-commit. h4. General comments This stuff is hard. Even if we forget about the AM discovery problem, I am sure others will find a bunch of other design considerations you may be missing now. I'd suggest spending more time on the design, atleast on some of the areas I pointed above and then create a branch, create sub-tasks, do some prototypes etc. > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Srikanth Sampath > Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, > WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167140#comment-15167140 ] Srikanth Sampath commented on MAPREDUCE-6608: - Thanks [~djp]. Few points: 1. The reads will only be for the inflight tasks out of the large MR job. That said, it is possible for it to be large - for example multiple AMs fail. 2. The read path in this case is required for communication between the task containers and the AM (not between task containers). So indeed it is a subset of the cases addressed in [YARN-4602|https://issues.apache.org/jira/browse/YARN-4602]. 3. We need more details on how [YARN-4602|https://issues.apache.org/jira/browse/YARN-4602] will be addressed. What's the latency for the payload to make it from the new AM to the registry (RM) and then to the NMs. How will the task containers fetch the new address. Should we still have the registry based read path work as a fallback. I will be very happy to work with you in parallel on this. > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Srikanth Sampath > Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, > WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163297#comment-15163297 ] Junping Du commented on MAPREDUCE-6608: --- bq. Please let me know your recommendation on Issue 1 Even with optimization, it still sounds risky for AM failure on a large MR job (with 10-100 thousands of tasks could be) for ZK based reader way. So I think we need a separate JIRA to track YARN issue as this one is MAPREDUCE jira which track changes for MR project. Per my comments in YARN-1489, we already have YARN-4602 to track a generic message passing-by problem between containers for YARN. Please check if that one fit into our cases here. If so, we can think to work in parallel on this (based on some hacked/faked read path first until we have a real one later). Thoughts? > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Srikanth Sampath > Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, > WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15162887#comment-15162887 ] Srikanth Sampath commented on MAPREDUCE-6608: - Thanks much [~djp] for your review and comments. Appreciate it very much. *Issue 1* {quote}+1 on Vinod's proposal of separating write and read path.{quote} I agree and will log a separate YARN JIRA. Do you think that effort should be linked to this work or can be done separately and later incorporated. Given your suggestion for optimizing - using the service record for other attempts (not the first one) the read paths will be much fewer. *Issue 2* {quote} We can involve a new MR config to switch on/off this feature (off by default). However, I didn't see any implementation on this in demo patch {quote} Yes, not in the demo patch. I will add it in the next version and also maintain the old code path when the configuration is off (the default). {quote} Beside we need to replace the read path of registry service, another point is we don't necessary to keep the first attempt AM info which could saving most of overhead we are adding here as most applications are expected to end with single attempt. Isn't it? {quote} Yes. That's correct. Very good suggestion. {quote}Agree that named argument sounds better. However, this way has work for a long time for MapReduce project and we won't prefer to change unless we find some issue or bug. For path to service record, we need keep consistent with our decision on read path. {quote} I think named arguments are better. If we end up changing the interface of YarnChild, I think we should do it. It depends on what we decide on *Issue 1* {quote}UmbilicalWithRetries should follow other existing practice (for RPC client retry during service down time) that to create a RetryProxy with FailoverProxyProvider (may be call it as MRAMProxy) for task attempt to contact with new attempt instance for AM.{quote} Thanks much for this very useful suggestion. I will incorporate it. Please let me know your recommendation on *Issue 1* > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Srikanth Sampath > Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, > WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159319#comment-15159319 ] Junping Du commented on MAPREDUCE-6608: --- Thanks [~srikanth.sampath] for updating the design doc and uploading an outstanding demo patch! Sorry for reply a little late as just come back from a vacation... Finally, I got chance to review the latest document and the demo patch. +1 on Vinod's proposal of separating write and read path. This solution is even better than my proposal (HDFS way) above as no single point access means better scalability. The only problem here is the implementation is more complicated as it involves new RPC service in NM (client side is task) and more payload between NM-RM heartbeat, so we should separate it out a dedicated YARN JIRA to track the work. Other quick comments on the design doc: bq. The work preserving feature of an MR Application can be set at an application level, when the application is submitted. Sounds good. We can involve a new MR config to switch on/off this feature (off by default). However, I didn't see any implementation on this in demo patch and I think we should add it in the beginning as we want to keep old behavior (code path) unchanged in case feature is off. bq. When the AM starts up, the registry operations is started as a service. An AM creates a service record id being the JobId and persistence being at the application level. It then stores the address(host, port) as an internal endpoint. Beside we need to replace the read path of registry service, another point is we don't necessary to keep the first attempt AM info which could saving most of overhead we are adding here as most applications are expected to end with single attempt. Isn't it? bq. Currently, YarnChild uses positional arguments as parameters. This will be enhanced to use named arguments as parameters. For work preserving jobs, the path to the service record is passed as the parameter to determine the address of the AM. Agree that named argument sounds better. However, this way has work for a long time for MapReduce project and we won't prefer to change unless we find some issue or bug. For path to service record, we need keep consistent with our decision on read path. bq. Thus UmbilicalWithRetries is a wrapper over Umbilical with retries implemented. Depending on whether the AM is workpreserving or not, a factory method creates either a vanilla umbilical or one with retries. UmbilicalWithRetries should follow other existing practice (for RPC client retry during service down time) that to create a RetryProxy with FailoverProxyProvider (may be call it as MRAMProxy) for task attempt to contact with new attempt instance for AM. TaskManagement part look good to me. > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Srikanth Sampath > Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, > WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143761#comment-15143761 ] Vinod Kumar Vavilapalli commented on MAPREDUCE-6608: bq. I agree that storing state in zookeeper may have scalability issues. I am just thinking that will it be ended up having too many small files in hdfs if we are planning to store AM information in HDFS. A solution for this is already given at YARN-1489 by [~bikassaha]. See this comment: https://issues.apache.org/jira/browse/YARN-1489?focusedCommentId=13862359&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13862359. The solution is essentially a combination of registry with YARN acting as a distributed readers solution: Registry owns the write path and storage, RM/NMs take care of providing scalable reads. > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Srikanth Sampath > Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, > WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136279#comment-15136279 ] Srikanth Sampath commented on MAPREDUCE-6608: - I have attached a design patch - [Patch1|https://issues.apache.org/jira/secure/attachment/12786705/Patch1.patch] that gives a high level approach on the implementation. The [Design|https://issues.apache.org/jira/secure/attachment/12786706/WorkPreservingMRAppMaster-2.pdf] document gives the high level design. *Notes:* 1. This is a patch against Apache 2.6.1 2. It works for the example hadoop sleep job - where I have killed the AM randomly and the inflight tasks continue. 3. SS_DEBUG in the patch indicates a debug statement that helps me. Some of these will be removed eventually. 4. SS_FIXME in the patch is a tag for me to fix some known issues that I have commented on. I will clean these up before the next submission. I solicit comments on the high level design and the approach I have taken in the patch. *Next Steps:* 1. I will iron out the known issues (all SS_FIXME), clean up the interfaces, make the code compliant with apache coding standards, rebase the code against trunk, and test it thoroughly. I will factor in the comments and suggestions that are made with the design doc and design patch. 2. Identify the components and issues involved and raise sub tasks. > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Srikanth Sampath > Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, > WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128785#comment-15128785 ] Srikanth Sampath commented on MAPREDUCE-6608: - I have attached the high level approach used in the solution that is in the works. I will update with a design patch and a detailed design soon. > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Srikanth Sampath > Attachments: WorkPreservingMRAppMaster-1.pdf, > WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106254#comment-15106254 ] Srikanth Sampath commented on MAPREDUCE-6608: - Thanks [~djp] for your comments. Yes indeed there are more issues with it, I have a working version of it which uses the registry. I have been in discussion with [~vvasudev]. There are some loose ends, which I will tie up and upload a patch and an updated design next week. > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Srikanth Sampath > Attachments: WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106139#comment-15106139 ] Junping Du commented on MAPREDUCE-6608: --- bq. I am just thinking that will it be ended up having too many small files in hdfs if we are planning to store AM information in HDFS. It shouldn't be a problem if we cleanup this file when job get completed (like hookup in job commit stage). [~srikanth.sampath], I think there are still many details need to be taken care besides new attempt AM address. I noticed this JIRA hasn't been updated for any patch work for a while. Is this work a high priority for you in the short term? If not, do you mind I take it and move it forward? Thanks! > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Srikanth Sampath > Attachments: WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106086#comment-15106086 ] Raju Bairishetti commented on MAPREDUCE-6608: - I agree that storing state in zookeeper may have scalability issues. I am just thinking that will it be ended up having too many small files in hdfs if we are planning to store AM information in HDFS. > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Srikanth Sampath > Attachments: WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106081#comment-15106081 ] Raju Bairishetti commented on MAPREDUCE-6608: - [~djp] Thanks for your views on this issue. > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Srikanth Sampath > Attachments: WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105267#comment-15105267 ] Junping Du commented on MAPREDUCE-6608: --- Thanks [~srikanth.sampath] and [~raju.bairishetti] for proposing this JIRA and upload a design document. This work could be a significant improvement to our MapReduce framework reliability. Go through the current design doc, I think store new attempt address for MR AM in zookeeper could have scalability issues in case MR job has massive running tasks (ten thousands or more). I think it could be better to store/get new MR AM location from HDFS which has better scalability. Also, in my understanding, Yarn Service Registry may not best fit for this case. CC [~ste...@apache.org] who is author of YSR. I could propose another version of design with more details in next few days in case we haven't started the development work yet. > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Raju Bairishetti > Attachments: WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105251#comment-15105251 ] Junping Du commented on MAPREDUCE-6608: --- Move it to MAPREDUCE project given most work in YARN (YARN-1489) has been done. > Work Preserving AM Restart for MapReduce > > > Key: MAPREDUCE-6608 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Srikanth Sampath >Assignee: Raju Bairishetti > Attachments: WorkPreservingMRAppMaster.pdf > > > Providing a framework for work preserving AM is achieved in > [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like > to take advantage of this for MapReduce(MR) applications. There are some > challenges which have been described in the attached document and few options > discussed. We solicit feedback from the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)