[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140343#comment-14140343
 ] 

Hudson commented on MAPREDUCE-5891:
---

SUCCESS: Integrated in Hadoop-Yarn-trunk #685 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/685/])
MAPREDUCE-5891. Improved shuffle error handling across NM restarts. Contributed 
by Junping Du (jlowe: rev 2c3da25fd718b3a9c1ed67f05b577975ae613f4e)
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleSchedulerImpl.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
* hadoop-mapreduce-project/CHANGES.txt


 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Fix For: 2.6.0

 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891-v5.patch, 
 MAPREDUCE-5891-v6.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140495#comment-14140495
 ] 

Hudson commented on MAPREDUCE-5891:
---

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1901 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1901/])
MAPREDUCE-5891. Improved shuffle error handling across NM restarts. Contributed 
by Junping Du (jlowe: rev 2c3da25fd718b3a9c1ed67f05b577975ae613f4e)
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
* hadoop-mapreduce-project/CHANGES.txt
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleSchedulerImpl.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java


 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Fix For: 2.6.0

 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891-v5.patch, 
 MAPREDUCE-5891-v6.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140549#comment-14140549
 ] 

Hudson commented on MAPREDUCE-5891:
---

FAILURE: Integrated in Hadoop-Hdfs-trunk #1876 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1876/])
MAPREDUCE-5891. Improved shuffle error handling across NM restarts. Contributed 
by Junping Du (jlowe: rev 2c3da25fd718b3a9c1ed67f05b577975ae613f4e)
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleSchedulerImpl.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
* hadoop-mapreduce-project/CHANGES.txt


 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Fix For: 2.6.0

 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891-v5.patch, 
 MAPREDUCE-5891-v6.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-18 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139364#comment-14139364
 ] 

Jason Lowe commented on MAPREDUCE-5891:
---

Thanks for updating the patch, Junping.

The property in mapred-default.xml should be updated to use 
$\{yarn.nodemanager.recovery.enabled}, otherwise in practice we'll never 
fallback to the code default that tries to lookup from the NM property.

Nit: huffleFetchEnabledDefault should be shuffleFetchEnabledDefault

[~mingma] do you have additional comments or concerns?

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891-v5.patch, 
 MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-18 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139511#comment-14139511
 ] 

Junping Du commented on MAPREDUCE-5891:
---

Thanks [~jlowe] for reviewing the patch! Fix it as your comments above in v6 
patch.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891-v5.patch, 
 MAPREDUCE-5891-v6.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-18 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139518#comment-14139518
 ] 

Ming Ma commented on MAPREDUCE-5891:


Junping, Jason, the patch looks good to me.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891-v5.patch, 
 MAPREDUCE-5891-v6.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139565#comment-14139565
 ] 

Hadoop QA commented on MAPREDUCE-5891:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12669804/MAPREDUCE-5891-v6.patch
  against trunk revision 1cf3198.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4895//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4895//console

This message is automatically generated.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891-v5.patch, 
 MAPREDUCE-5891-v6.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-18 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139577#comment-14139577
 ] 

Jason Lowe commented on MAPREDUCE-5891:
---

+1 lgtm.  Committing this.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891-v5.patch, 
 MAPREDUCE-5891-v6.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-17 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137615#comment-14137615
 ] 

Junping Du commented on MAPREDUCE-5891:
---

Sorry for coming a little late as busy in travel recently. Will deliver a quick 
fix for default value - prefer to use $
{yarn.nodemanager.recovery.enabled} as default as this seems to be the easiest 
way for short term. Thanks [~jlowe] and [~mingma] for review and good comments 
here!

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137753#comment-14137753
 ] 

Hadoop QA commented on MAPREDUCE-5891:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12669470/MAPREDUCE-5891-v5.patch
  against trunk revision ea4e2e8.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4890//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4890//console

This message is automatically generated.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891-v5.patch, 
 MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-09 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127043#comment-14127043
 ] 

Jason Lowe commented on MAPREDUCE-5891:
---

Thanks for updating the patch, Junping, and sorry for the delay in re-review.   
 The fixes all look fine.

I agree with Ming that we should be consistent about the default state of this 
feature and NM restart, although I'm not a fan of adding a YARN API to query NM 
restart.  Task containers currently don't talk with the NM, and IMHO this is 
not a good enough reason to change that.  I'm OK with adding it to the shuffle 
protocol if we can do it in a backwards-compatible way, although I don't know 
offhand how that would be accomplished.  Another approach is to try to tie the 
two properties together and have the default value of 
mapreduce.reduce.shuffle.fetch.retry.enabled in mapred-default.xml be 
$\{yarn.nodemanager.recovery.enabled\}, so they could still be set 
independently but by default the NM restart setting drives the fetch retry 
setting.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-09 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127212#comment-14127212
 ] 

Ming Ma commented on MAPREDUCE-5891:


The patch looks good. I like Jason's idea to have 
mapreduce.reduce.shuffle.fetch.retry.enabled use 
${yarn.nodemanager.recovery.enabled} as default value. As for the other 
approaches,

a) dynamic MR to YARN query, given NM recovery flag is a global cluster level 
setting ( although it is possible to config it on per NM basis ), can we derive 
the value of mapreduce.reduce.shuffle.fetch.retry.enabled at job submission 
time from some YARN API call to RM?

b) shuffle protocol change. It seems Fetcher and ShuffleHandler check http 
header via property key names. So if we add a new property to indicate if 
recovery is supported and continue to keep the same http version property, 
new version of fetcher might be able to work with old version of 
shufflehandler, and vise versa.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-09 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127309#comment-14127309
 ] 

Jason Lowe commented on MAPREDUCE-5891:
---

bq. a) dynamic MR to YARN query, given NM recovery flag is a global cluster 
level setting ( although it is possible to config it on per NM basis ), can we 
derive the value of mapreduce.reduce.shuffle.fetch.retry.enabled at job 
submission time from some YARN API call to RM?

The RM is unaware of whether the NM supports work-preserving restart, and I'd 
rather not add that coupling just for this.

bq. b) shuffle protocol change. It seems Fetcher and ShuffleHandler check http 
header via property key names. So if we add a new property to indicate if 
recovery is supported and continue to keep the same http version property, 
new version of fetcher might be able to work with old version of 
shufflehandler, and vise versa.

True, we could add a new HTTP header that new Fetchers could query.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-08 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126643#comment-14126643
 ] 

Ming Ma commented on MAPREDUCE-5891:


Thanks, Junping. Regarding the default value, 
mapreduce.reduce.shuffle.fetch.retry.enabled is set to true by default; while 
NM recovery is set to false by default. That means by default, fetcher will 
retry even though shufflehandler won't be able to serve mapper outputs after 
restart. It doesn't seem like a big deal. Just want to call out if that is 
intentional. Do we foresee other scenarios where fetch retry will be useful? If 
not, reducers can ask YARN if NM recovery is enabled or reducers can ask 
shufflehandler if recovery is enabled without defining this retry property.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-07 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125114#comment-14125114
 ] 

Junping Du commented on MAPREDUCE-5891:
---

Hi [~jlowe] and [~mingma], any additional comments on latest patch?

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-03 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120850#comment-14120850
 ] 

Junping Du commented on MAPREDUCE-5891:
---

Thanks for comments, [~mingma] and [~jlowe]!
bq. In the case slowstart is set to some small value, the reducer will fetch 
some mapper output and wait for the rest. Is it possible Fetcher.retryStartTime 
is set to some old value due to early NM host A restart, and thus mark fetcher 
retry timed out when it later tries to handle NM host B restart?
Nice catch! Fixed as Jason's suggestion below.

bq. so I think this just adds to the log length without adding a lot of 
valuable information
Agree. remove the log.

bq. Nit: The following code should simply be retryStartTime = 0;
Fixed.

bq. setupConnectionsWithRetry is now inconsistent when it comes to calling 
abortConnect() when stopped is true.
Good point. Fixed.

Also, agree above comments from [~jlowe] on YARN-914 and YARN-1593. 

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120879#comment-14120879
 ] 

Hadoop QA commented on MAPREDUCE-5891:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12666395/MAPREDUCE-5891-v4.patch
  against trunk revision 3a0142b.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4846//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4846//console

This message is automatically generated.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-02 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118290#comment-14118290
 ] 

Ming Ma commented on MAPREDUCE-5891:


Thanks, Junping, Jason for the useful patch.

In the case slowstart is set to some small value, the reducer will fetch some 
mapper output and wait for the rest. Is it possible Fetcher.retryStartTime is 
set to some old value due to early NM host A restart, and thus mark fetcher 
retry timed out when it later tries to handle NM host B restart?

To make sure fetcher doesn't unnecessarily retry for the decommission scenario, 
it seems the assumption is we will have some sort of graceful decommission 
support so that during decommission process the fetcher will still be able to 
get mapper output. Is it true?

If we get time to do YARN-1593, that will further reduce the chance of shuffle 
handler restart. Any opinion on that?

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-02 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118727#comment-14118727
 ] 

Jason Lowe commented on MAPREDUCE-5891:
---

Thanks for updating the patch, Junping.  Comments:

I'm not sure the following log message adds that much value given the exception 
is going to be logged just a bit later.  A different log message appears when 
retry is enabled, so I think this just adds to the log length without adding a 
lot of valuable information.
{code}
+if (!fetchRetryEnabled) {
+   LOG.warn(Failed to connect to host:  + url + 
+ when retry is not enabled);
+   throw e;
+}
{code}

setupConnectionsWithRetry is now inconsistent when it comes to calling 
abortConnect() when {{stopped}} is true.

Nit:  The following code should simply be {{retryStartTime = 0;}}
{code}
+  if (retryStartTime != 0) {
+retryStartTime = 0;
+  }
{code}


And to address some of Ming's comments:

bq. Is it possible Fetcher.retryStartTime is set to some old value due to early 
NM host A restart, and thus mark fetcher retry timed out when it later tries to 
handle NM host B restart?

Yes, that is totally possible.  Nice catch, Ming!
@Junping, the code needs to set retryStartTime to zero at the beginning of 
copyFromHost as well to make sure remnants from previous failures don't leak 
into new, fresh attempts.

bq. To make sure fetcher doesn't unnecessarily retry for the decommission 
scenario, it seems the assumption is we will have some sort of graceful 
decommission support so that during decommission process the fetcher will still 
be able to get mapper output.

Yes, see YARN-914.  If the decommission still takes place even though the 
application is still running and output still needs to be fetched from that 
node, the RM will inform the AM of the node being removed from the cluster.  
The AM will then inform the reducers that the map outputs on that node are 
obsolete and will reschedule map tasks on other nodes rather than have the 
reducers keep trying against a decomm'd node.

bq. If we get time to do YARN-1593, that will further reduce the chance of 
shuffle handler restart. Any opinion on that?

Yes that would be quite nice, although that seems like a very significant 
feature to implement in the 2.6 timeframe.  It creates new security, 
reacquisition, and failure issues, and while it does significantly reduce the 
scenarios where the ShuffleHandler will need to be restarted it doesn't, by 
itself, preclude the need to do so.  For example, if the ShuffleHandler has a 
bugfix that needs to be deployed we either need the ability for MapReduce to 
request different versions of an aux service and have multiple versions running 
simultaneously or we need to restart the ShuffleHandler service.  The latter 
requires workarounds like this JIRA to avoid potential fetch failure storms 
when the ShuffleHandler service is down temporarily.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-08-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116252#comment-14116252
 ] 

Hadoop QA commented on MAPREDUCE-5891:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12665542/MAPREDUCE-5891-v3.patch
  against trunk revision 270a271.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4838//console

This message is automatically generated.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-08-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116641#comment-14116641
 ] 

Hadoop QA commented on MAPREDUCE-5891:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12665542/MAPREDUCE-5891-v3.patch
  against trunk revision 258c7d0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4839//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4839//console

This message is automatically generated.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891-v3.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-08-29 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14115958#comment-14115958
 ] 

Jason Lowe commented on MAPREDUCE-5891:
---

Thanks for updating the patch!  Comments:

SHUFFLE_FETCH_TIMEOUT_MS = mapreduce.reduce.shuffle.fetch.timeout-ms but it 
should be mapreduce.reduce.shuffle.fetch.retry.timeout-ms

openConnectionWithRetry calls abortConnect if stopped, but the one caller of 
this function does the same thing when it returns.  Maybe 
openConnectionWithRetry should just return if stopped?

Nit: The code block in copyMapOutput's catch of IOException is getting really 
long.  It would be good to refactor some of this code into methods

Minor nit: get failed should be failed.

openConnectionWithRetry is being called and retries even if fetch retry is 
disabled

Shouldn't we be setting retryStartTime back to zero instead of endTime below?  
Otherwise the next error could timeout without any retry if the transfer before 
the error took longer than the timeout interval.
{code}
  // Refresh retryStartTime as map task make progress if retried before.
  if (retryStartTime != 0) {
retryStartTime = endTime;
  }
{code}
Also wondering if we should reset it after each successful transfer (e.g.: 
after a successful header parse and successful shuffle)?


 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-08-29 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116244#comment-14116244
 ] 

Junping Du commented on MAPREDUCE-5891:
---

Thanks [~jlowe] for comments!
bq. SHUFFLE_FETCH_TIMEOUT_MS should be 
mapreduce.reduce.shuffle.fetch.retry.timeout-ms
Nice catch, done.

bq. openConnectionWithRetry calls abortConnect if stopped, but the one caller 
of this function does the same thing when it returns. Maybe 
openConnectionWithRetry should just return if stopped?
Yes. Even caller can return directly as caller from upper layer already address 
it. Fixed.

bq. Nit: The code block in copyMapOutput's catch of IOException is getting 
really long. It would be good to refactor some of this code into methods. Minor 
nit: get failed should be failed.
Done.

bq. openConnectionWithRetry is being called and retries even if fetch retry is 
disabled
Good point, fixed.

bq. Shouldn't we be setting retryStartTime back to zero instead of endTime 
below?
Also good one, fixed it. 

bq. Also wondering if we should reset it after each successful transfer (e.g.: 
after a successful header parse and successful shuffle)?
May not be necessary. If retryStartTime is not 0, which means this fetcher 
haven't successfully make any progress since last failure of getMapOutput, it 
should keep trying and wait time aggregation until timeout. 

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-08-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113203#comment-14113203
 ] 

Hadoop QA commented on MAPREDUCE-5891:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12664795/MAPREDUCE-5891-v2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4829//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4829//console

This message is automatically generated.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
 MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-08-26 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110947#comment-14110947
 ] 

Junping Du commented on MAPREDUCE-5891:
---

Thanks [~jlowe] for review! I just updated the patch with addressing most of 
your previous comments and add unit test. Please help to review it again, Thx!

bq. I was under the impression that the copyMapOutput retry could cause a 
reconnect which itself would have retries. If that's not the case then there's 
no issue with nested retries.
If copyMapOutput throw exception during NM restart, then it will firstly go to 
reconnect with retry as in most cases the connect will get failed except we 
tolerant sometime to wait NM get recovered. We can also give up retry in 
connect, but the logic will be more complexity as something like following, 
which may not be necessary?
{code}
while (...) {
  try {
 failedTasks = copyMapOutput(...);
  } catch (IOException e) {
 try {
 connect(...);
  } catch {
// do nothing, back to the loop.
  }
   }
{code}

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-08-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111009#comment-14111009
 ] 

Hadoop QA commented on MAPREDUCE-5891:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12664422/MAPREDUCE-5891.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4825//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4825//console

This message is automatically generated.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-08-26 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111538#comment-14111538
 ] 

Jason Lowe commented on MAPREDUCE-5891:
---

Thanks for updating the patch, Junping.  Comments:

DEFAULT_SHUFFLE_FETCH_* should be in MRJobConfig

openConnectionWithRetry should not ignore InterruptedException.  Fetchers are 
shutdown by being interrupted, so I think minimally we should check for 
stopped==true if one occurs and act accordingly.

We log a WARN when we can retry but only an INFO when we failed to read a map 
header and are not retrying. That seems backwards.  Also the message logged 
when we can't retry is a lot more informative than the one when we can.

We are retrying one more time when we're past the retry timeout which could 
result in a significantly longer time to discover fetch failures that aren't NM 
restart-related.  This is also inconsistent with how openConnectionWithRetry 
behaves.

Only the retry enabled property was added to mapred-default.xml.  We should 
also add the other two properties with their defaults and appropriate 
descriptions for documentation.

There should be a unit test to verify fetch errors can still be reported even 
with retry enabled, as it's important that we don't break the ability to 
recover from errors not related to NM restart.

Nit: mapreduce.reduce.shuffle.fetch.interval-ms should be 
mapreduce.reduce.shuffle.fetch.retry.interval-ms to clearly indicate this is an 
interval only applicable for fetch retry.  Similarly 
mapreduce.reduce.shuffle.fetch.timeout-ms should be 
mapreduce.reduce.shuffle.fetch.retry.timeout-ms.

Nit: which means it haven't retried yet. should be which means it hasn't 
retried yet.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-08-22 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107011#comment-14107011
 ] 

Jason Lowe commented on MAPREDUCE-5891:
---

bq. The reason I choose to do two things separately here is because the retry 
connection has some IOExceptions (like things get failed in verifyConnection() 
is not because of NM restart) that we don't need to retry anymore. Wrap this 
kind of exceptions within retry may sounds unnecessary. Thoughts?

Ah, yes good point.  We shouldn't retry exceptions that we know aren't related 
to the NM bouncing, like secret hash verification failures.

I was under the impression that the copyMapOutput retry could cause a reconnect 
which itself would have retries.  If that's not the case then there's no issue 
with nested retries.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-08-21 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106119#comment-14106119
 ] 

Jason Lowe commented on MAPREDUCE-5891:
---

Thanks for posting a prototype, Junping!  It's similar to what I was thinking.  
Some initial comments from a brief look at the patch:

It could be very inefficient to re-shuffle a large map output that we already 
have.  If we get an error and need to reconnect then we should rebuild the URL 
based on the remaining maps rather than keep using the original URL and 
throwing away data for maps we already processed.  In essence we should never 
reconnect and ask for what we already have.

It seems like we should only need one-level of retry rather than two.  I think 
we only need to retry at the top-level where we try to copy map outputs.  If 
copying map outputs fails either because of a severed connection, a connection 
reset/timeout, or whatever and we have not spent too much time trying to 
recover from errors for the same map output then we retry.   As soon as we 
successfully shuffle a map then we reset the time (because things are working 
and we're making progress) and move onto the next map output.  I don't think we 
need to retry at this level and also at the socket connection level.  The upper 
level retry will cause us to automatically retry the lower-level socket 
connection, unless I'm missing something.

We probably should use Time.monotonicNow or something similar that won't break 
the desired delay behavior if the clock changes.

miliseconds should be milliseconds


 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-08-21 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106341#comment-14106341
 ] 

Junping Du commented on MAPREDUCE-5891:
---

Thanks [~jlowe] for quickly feedback!
bq. If we get an error and need to reconnect then we should rebuild the URL 
based on the remaining maps rather than keep using the original URL and 
throwing away data for maps we already processed.
That's very good point, will fix it in patch latter.

bq. It seems like we should only need one-level of retry rather than two. I 
think we only need to retry at the top-level where we try to copy map outputs. 
If copying map outputs fails either because of a severed connection, a 
connection reset/timeout, or whatever and we have not spent too much time 
trying to recover from errors for the same map output then we retry. 
One-level of retry sounds good. Actually, the current implementation is mostly 
a one-level retry, but just retry connection and copyMapOutput separately. The 
reason I choose to do two things separately here is because the retry 
connection has some IOExceptions (like things get failed in verifyConnection() 
is not because of NM restart) that we don't need to retry anymore. Wrap this 
kind of exceptions within retry may sounds unnecessary. Thoughts?

bq. We probably should use Time.monotonicNow or something similar that won't 
break the desired delay behavior if the clock changes. miliseconds should be 
milliseconds
Also good points, will fix it in latter patch.


 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Junping Du
 Attachments: MAPREDUCE-5891-demo.patch


 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-08-18 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101731#comment-14101731
 ] 

Junping Du commented on MAPREDUCE-5891:
---

Hi [~jlowe], I like to work on this if you haven't started to work on it.

 Improved shuffle error handling across NM restarts
 --

 Key: MAPREDUCE-5891
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.5.0
Reporter: Jason Lowe

 To minimize the number of map fetch failures reported by reducers across an 
 NM restart it would be nice if reducers only reported a fetch failure after 
 trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)