[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-06-07 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15318652#comment-15318652
 ] 

Junping Du commented on MAPREDUCE-6608:
---

I have create branch MAPREDUCE-6608 for this development work.

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-04-19 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248151#comment-15248151
 ] 

Junping Du commented on MAPREDUCE-6608:
---

[~vinodkv], thanks for review and comments. I think most your points here are 
solid, however, the comments about "Output Commit of previous tasks" is a bit 
stale.

bq. The new AM needs to make sure that output of previously running containers 
can be safely committed. IIRC, with today's FileOutputCommitter, new AM will 
only promote task-outputs that are present in 
$jobOutput/_temporary/$currentAttemptID/
This is true before YARN-4815. However, after YARN-4815, most task-output 
commit to job final output is handled by {{FileOutputCommitter.commitTask()}} 
instead of {{FileOutputCommitter.commitJob()}}. So the commitJob() only left 
work of cleanup $jobOutput/_temporary. So there is nothing need to do here 
unless we make sure "mapreduce.fileoutputcommitter.algorithm.version" is set to 
2. 
This is also an assumption setting for work of MAPREDUCE-5485 which is a 
prerequisite for feature here - or AM will failed directly in case previous AM 
ends in job committing.

Investigating on rest of issues and will propose some possible solutions later. 
 


bq. I'd suggest spending more time on the design, atleast on some of the areas 
I pointed above and then create a branch, create sub-tasks, do some prototypes 
etc.
+1. This feature work could be a bit over my expectation before. I agree we 
could need a separated branch for developing this in parallel. Will create a 
branch once we finalize our design work. 


> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-04-17 Thread Srikanth Sampath (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244727#comment-15244727
 ] 

Srikanth Sampath commented on MAPREDUCE-6608:
-

Thanks [~vinodkv] for your comments.  I will dig deeper into these and respond. 
 

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-04-15 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243948#comment-15243948
 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-6608:


[~srikanth.sampath] / [~djp],

Got around to reading the design doc attached - there are a few important 
details that aren't covered in the doc, besides the AM discovery problem itself

h4. Output Commit of previous tasks
The new AM needs to make sure that output of previously running containers can 
be safely committed. IIRC, with today's FileOutputCommitter, new AM will only 
promote task-outputs that are present in 
$jobOutput/_temporary/$currentAttemptID/

Similar changes may be needed for other OutputCommitters out there.

h4. Task Output Commit races

It doesn't look like we record task-commit in JobHistory, so it is possible 
that the previous AM gave a commit go-ahead to a taskAttempt which is either 
(a) in the process of committing output or (b) committed the output but fails 
to report to either of the AMs. In this case, two taskAttempts can be 
committing at the same time!

In the same line, without recording the success of a commit after a task 
finishes committing, we will run into issues.

h4. Conflicting TaskAttemptIDs

Today, we launch containers first and then record it in JobHistory. Because of 
this, if the previous AM started a TaskAttempt but crashed before recording it 
in JobHistory, and this oldTaskAttempt somehow cannot get reconnected to the 
new AM due to network issues, the new AM generates the same TaskAttemptID for a 
newer attempt and they both will collide on HDFS and/or the local NM output 
directories if they both happen to run on the same machine.

The above problem will be worse when speculative tasks are involved.

h4. Security
AM should use the same job-token as the previous incarnation otherwise the old 
running tasks will get authentication failures. I quickly checked and it seems 
like the AM itself generates the token, which means the second AM will generate 
a different one and all running tasks will fail to sync back!

h4. Others
bq. In the WP case, upon a loss of connection to the AM the tasks will try and 
reestablish the connection with the new AM.
This will not suffice. It is possible even today, but when a network partition 
occurs and two AMs end up running at the same time and give commit-go 
permission to two TaskAttempts of the same task, they will collide on the 
output-commit.

h4. General comments
This stuff is hard. Even if we forget about the AM discovery problem, I am sure 
others will find a bunch of other design considerations you may be missing now.

I'd suggest spending more time on the design, atleast on some of the areas I 
pointed above and then create a branch, create sub-tasks, do some prototypes 
etc.

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-25 Thread Srikanth Sampath (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167140#comment-15167140
 ] 

Srikanth Sampath commented on MAPREDUCE-6608:
-

Thanks [~djp].  Few points:
1. The reads will only be for the inflight tasks out of the large MR job.  That 
said, it is possible for it to be large - for example multiple AMs fail.

2. The read path in this case is required for communication between the task 
containers and the AM (not between task containers).  So indeed it is a subset 
of the cases addressed in 
[YARN-4602|https://issues.apache.org/jira/browse/YARN-4602].

3. We need more details on how 
[YARN-4602|https://issues.apache.org/jira/browse/YARN-4602] will be addressed.  
What's the latency for the payload to make it from the new AM to the registry 
(RM) and then to the NMs.  How will the task containers fetch the new address. 
Should we still have the registry based read path work as a fallback.

I will be very happy to work with you in parallel on this.  

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-24 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163297#comment-15163297
 ] 

Junping Du commented on MAPREDUCE-6608:
---

bq. Please let me know your recommendation on Issue 1
Even with optimization, it still sounds risky for AM failure on a large MR job 
(with 10-100 thousands of tasks could be) for ZK based reader way. So I think 
we need a separate JIRA to track YARN issue as this one is MAPREDUCE jira which 
track changes for MR project.
Per my comments in YARN-1489, we already have YARN-4602 to track a generic 
message passing-by problem between containers for YARN. Please check if that 
one fit into our cases here. If so, we can think to work in parallel on this 
(based on some hacked/faked read path first until we have a real one later). 
Thoughts?

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-24 Thread Srikanth Sampath (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15162887#comment-15162887
 ] 

Srikanth Sampath commented on MAPREDUCE-6608:
-

Thanks much [~djp] for your review and comments.  Appreciate it very much. 

*Issue 1*
{quote}+1 on Vinod's proposal of separating write and read path.{quote}
I agree and will log a separate YARN JIRA.  Do you think that effort should be 
linked to this work or can be done separately and later incorporated.  Given 
your suggestion for optimizing - using the service record for other attempts 
(not the first one) the read paths will be much fewer.  

*Issue 2*
{quote} We can involve a new MR config to switch on/off this feature (off by 
default). However, I didn't see any implementation on this in demo patch {quote}
Yes, not in the demo patch.  I will add it in the next version and also 
maintain the old code path when the configuration is off (the default).

{quote} Beside we need to replace the read path of registry service, another 
point is we don't necessary to keep the first attempt AM info which could 
saving most of overhead we are adding here as most applications are expected to 
end with single attempt. Isn't it? {quote}
Yes.  That's correct.  Very good suggestion.

{quote}Agree that named argument sounds better. However, this way has work for 
a long time for MapReduce project and we won't prefer to change unless we find 
some issue or bug. For path to service record, we need keep consistent with our 
decision on read path. {quote}
I think named arguments are better.  If we end up changing the interface of 
YarnChild, I think we should do it.  It depends on what we decide on *Issue 1*

{quote}UmbilicalWithRetries should follow other existing practice (for RPC 
client retry during service down time) that to create a RetryProxy with 
FailoverProxyProvider (may be call it as MRAMProxy) for task attempt to contact 
with new attempt instance for AM.{quote}
Thanks much for this very useful suggestion.  I will incorporate it.

Please let me know your recommendation on *Issue 1*

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-23 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159319#comment-15159319
 ] 

Junping Du commented on MAPREDUCE-6608:
---

Thanks [~srikanth.sampath] for updating the design doc and uploading an 
outstanding demo patch!
Sorry for reply a little late as just come back from a vacation... Finally, I 
got chance to review the latest document and the demo patch. 

+1 on Vinod's proposal of separating write and read path. This solution is even 
better than my proposal (HDFS way) above as no single point access means better 
scalability. The only problem here is the implementation is more complicated as 
it involves new RPC service in NM (client side is task) and more payload 
between NM-RM heartbeat, so we should separate it out a dedicated YARN JIRA to 
track the work.

Other quick comments on the design doc:

bq.  The work preserving feature of an MR Application can be set at an 
application level, when the application is submitted.
Sounds good. We can involve a new MR config to switch on/off this feature (off 
by default). However, I didn't see any implementation on this in demo patch and 
I think we should add it in the beginning as we want to keep old behavior (code 
path) unchanged in case feature is off.

bq. When the AM starts up, the registry operations is started as a service. An 
AM creates a service record id being the JobId and persistence being at the 
application level. It then stores the address(host, port) as an internal 
endpoint.
Beside we need to replace the read path of registry service, another point is 
we don't necessary to keep the first attempt AM info which could saving most of 
overhead we are adding here as most applications are expected to end with 
single attempt. Isn't it?

bq. Currently, YarnChild uses positional arguments as parameters. This will be 
enhanced to use named arguments as parameters. For work preserving jobs, the 
path to the service record is passed as the parameter to determine the address 
of the AM.
Agree that named argument sounds better. However, this way has work for a long 
time for MapReduce project and we won't prefer to change unless we find some 
issue or bug. For path to service record, we need keep consistent with our 
decision on read path.

bq. Thus UmbilicalWithRetries is a wrapper over Umbilical with retries 
implemented. Depending on whether the AM is workpreserving or not, a factory 
method creates either a vanilla umbilical or one with retries.
UmbilicalWithRetries should follow other existing practice (for RPC client 
retry during service down time) that to create a RetryProxy with 
FailoverProxyProvider (may be call it as MRAMProxy) for task attempt to contact 
with new attempt instance for AM. 

TaskManagement part look good to me.

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-11 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143761#comment-15143761
 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-6608:


bq. I agree that storing state in zookeeper may have scalability issues. I am 
just thinking that will it be ended up having too many small files in hdfs if 
we are planning to store AM information in HDFS.
A solution for this is already given at YARN-1489 by [~bikassaha]. See this 
comment: 
https://issues.apache.org/jira/browse/YARN-1489?focusedCommentId=13862359&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13862359.

The solution is essentially a combination of registry with YARN acting as a 
distributed readers solution: Registry owns the write path and storage, RM/NMs 
take care of providing scalable reads.

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-07 Thread Srikanth Sampath (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136279#comment-15136279
 ] 

Srikanth Sampath commented on MAPREDUCE-6608:
-

I have attached a design patch - 
[Patch1|https://issues.apache.org/jira/secure/attachment/12786705/Patch1.patch] 
that gives a high level approach on the implementation.  The 
[Design|https://issues.apache.org/jira/secure/attachment/12786706/WorkPreservingMRAppMaster-2.pdf]
 document gives the high level design.

*Notes:*
1. This is a patch against Apache 2.6.1
2. It works for the example hadoop sleep job - where I have killed the  AM 
randomly and the inflight tasks continue.
3. SS_DEBUG in the patch indicates a debug statement that helps me. Some of 
these will be removed eventually.
4. SS_FIXME in the patch is a tag for me to fix some known issues that I have 
commented on.  I will clean these up before the next submission.

I solicit comments on the high level design and the approach I have taken in 
the patch.

*Next Steps:*
1. I will iron out the known issues (all SS_FIXME), clean up the interfaces,  
make the code compliant with apache coding standards, rebase the code against 
trunk, and test it thoroughly.  I will factor in the comments and suggestions 
that are made with the design doc and design patch.
2. Identify the components and issues involved and raise sub tasks.  

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-02 Thread Srikanth Sampath (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128785#comment-15128785
 ] 

Srikanth Sampath commented on MAPREDUCE-6608:
-

I have attached the high level approach used in the solution that is in the 
works.  I will update with a design patch and a detailed design soon.

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-01-18 Thread Srikanth Sampath (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106254#comment-15106254
 ] 

Srikanth Sampath commented on MAPREDUCE-6608:
-

Thanks [~djp] for your comments.  Yes indeed there are more issues with it, I 
have a working version of it which uses the registry.  I have been in 
discussion with [~vvasudev].  There are some loose ends, which I will tie up 
and upload a patch and an updated design next week.


> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-01-18 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106139#comment-15106139
 ] 

Junping Du commented on MAPREDUCE-6608:
---

bq. I am just thinking that will it be ended up having too many small files in 
hdfs if we are planning to store AM information in HDFS.
It shouldn't be a problem if we cleanup this file when job get completed (like 
hookup in job commit stage).

[~srikanth.sampath], I think there are still many details need to be taken care 
besides new attempt AM address. I noticed this JIRA hasn't been updated for any 
patch work for a while. Is this work a high priority for you in the short term? 
If not, do you mind I take it and move it forward? Thanks!

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-01-18 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106086#comment-15106086
 ] 

Raju Bairishetti commented on MAPREDUCE-6608:
-

I agree that storing state in zookeeper may have scalability issues. I am just 
thinking that will it be ended up having too many small files in hdfs if we are 
planning to store AM information in HDFS.

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-01-18 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106081#comment-15106081
 ] 

Raju Bairishetti commented on MAPREDUCE-6608:
-

[~djp] Thanks for your views on this issue.

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-01-18 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105267#comment-15105267
 ] 

Junping Du commented on MAPREDUCE-6608:
---

Thanks [~srikanth.sampath] and [~raju.bairishetti] for proposing this JIRA and 
upload a design document. This work could be a significant improvement to our 
MapReduce framework reliability. 
Go through the current design doc, I think store new attempt address for MR AM 
in zookeeper could have scalability issues in case MR job has massive running 
tasks (ten thousands or more). I think it could be better to store/get new MR 
AM location from HDFS which has better scalability. 
Also, in my understanding, Yarn Service Registry may not best fit for this 
case. CC [~ste...@apache.org] who is author of YSR.
I could propose another version of design with more details in next few days in 
case we haven't started the development work yet.

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Raju Bairishetti
> Attachments: WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-01-18 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105251#comment-15105251
 ] 

Junping Du commented on MAPREDUCE-6608:
---

Move it to MAPREDUCE project given most work in YARN (YARN-1489) has been done.

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Raju Bairishetti
> Attachments: WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)