[jira] [Commented] (MAPREDUCE-6024) java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour

2014-08-12 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093835#comment-14093835
 ] 

Zhijie Shen commented on MAPREDUCE-6024:


bq. 1. For MAX_FETCH_FAILURES_NOTIFICATIONS, if change to proportional to the 
number of reducers, it will be same as MAX_ALLOWED_FETCH_FAILURES_FRACTION, so 
I deleted it. I do believe 

Sounds good to me. Under existing defaults, the only cases that failure will be 
triggered previously but not after the patch is fetchFailures = 2 and 
shufflingReduceTasks =3. According to the problem described in this jira, it 
makes sense to give fewer chances to the smaller number of reducer tasks. And 
if users really want to give the fetcher enough chance, it can tune 
MAX_ALLOWED_FETCH_FAILURES_FRACTION, and even make it go beyond 1.0.

bq. 4. Sometimes fetcher can get data successfully after retry from 
SocketTimeoutException, so I think let fetcher retry some times is OK.

Sounds reasonable. In addition, I linked back to the previous comments in 
[MAPREDUCE-4772|https://issues.apache.org/jira/browse/MAPREDUCE-4772?focusedCommentId=13492593page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13492593],
 which said connect exception more severe than timeout.

[~venkateshrin], do you have any further comments?

Some more comments:

1. maxfetchfailuresfraction - max-fetch-failures-fraction? and maxhostfailures 
- max-host-failures?
{code}
+  public static final String MAX_ALLOWED_FETCH_FAILURES_FRACTION = 
mapreduce.reduce.shuffle.maxfetchfailuresfraction;
{code}
{code}
+  public static final String MAX_SHUFFLE_FETCH_HOST_FAILURES = 
mapreduce.reduce.shuffle.maxhostfailures;
{code}

2. Is it necessary to multiply the failures by numMaps? copyFailed is in a loop 
and invoked for each remaining/failed task, right?
{code}
+//report failure if already retried maxHostFailures times
+boolean hostFail = hostFailures.get(hostname).get()  this.maxHostFailures
+* numMaps ? true : false;
{code}

BTW, you may want to click Submit Patch to ask Jenkins to verify your patch.

 java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 
 hour
 -

 Key: MAPREDUCE-6024
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6024
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: mr-am, task
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Critical
 Attachments: MAPREDUCE-6024.1.patch, MAPREDUCE-6024.patch


 2014-08-04 21:09:42,356 WARN fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
 fake.host.name:13562 with 2 map outputs
 java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
 at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
 at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
 at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
 at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
 2014-08-04 21:09:42,360 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: 
 fake.host.name:13562 freed by fetcher#33 in 180024ms
 2014-08-04 21:09:55,360 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning 
 fake.host.name:13562 with 3 to fetcher#33
 2014-08-04 21:09:55,360 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 
 to fake.host.name:13562 to fetcher#33
 2014-08-04 21:12:55,463 WARN fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
 fake.host.name:13562 with 3 map outputs
 java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
 at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
 at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
 at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
 at 

[jira] [Updated] (MAPREDUCE-6024) java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour

2014-08-12 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated MAPREDUCE-6024:


Status: Patch Available  (was: Open)

 java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 
 hour
 -

 Key: MAPREDUCE-6024
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6024
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: mr-am, task
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Critical
 Attachments: MAPREDUCE-6024.1.patch, MAPREDUCE-6024.2.patch, 
 MAPREDUCE-6024.patch


 2014-08-04 21:09:42,356 WARN fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
 fake.host.name:13562 with 2 map outputs
 java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
 at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
 at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
 at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
 at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
 2014-08-04 21:09:42,360 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: 
 fake.host.name:13562 freed by fetcher#33 in 180024ms
 2014-08-04 21:09:55,360 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning 
 fake.host.name:13562 with 3 to fetcher#33
 2014-08-04 21:09:55,360 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 
 to fake.host.name:13562 to fetcher#33
 2014-08-04 21:12:55,463 WARN fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
 fake.host.name:13562 with 3 map outputs
 java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
 at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
 at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
 at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
 at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
 ...
 2014-08-04 22:03:13,416 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: 
 fake.host.name:13562 freed by fetcher#33 in 271081ms
 2014-08-04 22:04:13,417 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning 
 fake.host.name:13562 with 3 to fetcher#33
 2014-08-04 22:04:13,417 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 
 to fake.host.name:13562 to fetcher#33
 2014-08-04 22:07:13,449 WARN fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
 fake.host.name:13562 with 3 map outputs
 java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
 at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
 at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
 at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
 at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-6024) java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour

2014-08-12 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated MAPREDUCE-6024:


Attachment: MAPREDUCE-6024.2.patch

Update patches changed: maxfetchfailuresfraction - max-fetch-failures-fraction 
and maxhostfailures - max-host-failures

It's necessary to multiply the failures by numMaps. Because when 
SocketTimeoutException happened, copyFailed will add numMaps to hostFailures.

 java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 
 hour
 -

 Key: MAPREDUCE-6024
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6024
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: mr-am, task
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Critical
 Attachments: MAPREDUCE-6024.1.patch, MAPREDUCE-6024.2.patch, 
 MAPREDUCE-6024.patch


 2014-08-04 21:09:42,356 WARN fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
 fake.host.name:13562 with 2 map outputs
 java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
 at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
 at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
 at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
 at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
 2014-08-04 21:09:42,360 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: 
 fake.host.name:13562 freed by fetcher#33 in 180024ms
 2014-08-04 21:09:55,360 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning 
 fake.host.name:13562 with 3 to fetcher#33
 2014-08-04 21:09:55,360 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 
 to fake.host.name:13562 to fetcher#33
 2014-08-04 21:12:55,463 WARN fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
 fake.host.name:13562 with 3 map outputs
 java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
 at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
 at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
 at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
 at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
 ...
 2014-08-04 22:03:13,416 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: 
 fake.host.name:13562 freed by fetcher#33 in 271081ms
 2014-08-04 22:04:13,417 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning 
 fake.host.name:13562 with 3 to fetcher#33
 2014-08-04 22:04:13,417 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 
 to fake.host.name:13562 to fetcher#33
 2014-08-04 22:07:13,449 WARN fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
 fake.host.name:13562 with 3 map outputs
 java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
 at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
 at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
 at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
 at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Moved] (MAPREDUCE-6033) Users are not allowed to view their own jobs, denied by JobACLsManager

2014-08-12 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K moved YARN-2407 to MAPREDUCE-6033:


  Component/s: (was: applications)
Affects Version/s: (was: 2.4.1)
   2.4.1
  Key: MAPREDUCE-6033  (was: YARN-2407)
  Project: Hadoop Map/Reduce  (was: Hadoop YARN)

 Users are not allowed to view their own jobs, denied by JobACLsManager
 --

 Key: MAPREDUCE-6033
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6033
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Yu Gao
Assignee: Yu Gao
 Attachments: YARN-2407.patch


 Have a Hadoop 2.4.1 cluster with Yarn ACL enabled, and try to submit jobs as 
 a non-admin user user1. The job could be finished successfully, but the 
 running progress was not displayed correctly on the command-line, and I got 
 following in the corresponding ApplicationMaster log:
 INFO [IPC Server handler 0 on 56717] org.apache.hadoop.ipc.Server: IPC Server 
 handler 0 on 56717, call 
 org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getJobReport from 
 9.30.95.26:61024 Call#59 Retry#0
 org.apache.hadoop.security.AccessControlException: User user1 cannot perform 
 operation VIEW_JOB on job_1407456690588_0003
   at 
 org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.verifyAndGetJob(MRClientService.java:191)
   at 
 org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.getJobReport(MRClientService.java:233)
   at 
 org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122)
   at 
 org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at 
 java.security.AccessController.doPrivileged(AccessController.java:366)
   at javax.security.auth.Subject.doAs(Subject.java:572)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1567)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-6024) java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour

2014-08-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093995#comment-14093995
 ] 

Hadoop QA commented on MAPREDUCE-6024:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12661164/MAPREDUCE-6024.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4799//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4799//console

This message is automatically generated.

 java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 
 hour
 -

 Key: MAPREDUCE-6024
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6024
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: mr-am, task
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Critical
 Attachments: MAPREDUCE-6024.1.patch, MAPREDUCE-6024.2.patch, 
 MAPREDUCE-6024.patch


 2014-08-04 21:09:42,356 WARN fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
 fake.host.name:13562 with 2 map outputs
 java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
 at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
 at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
 at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
 at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
 2014-08-04 21:09:42,360 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: 
 fake.host.name:13562 freed by fetcher#33 in 180024ms
 2014-08-04 21:09:55,360 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning 
 fake.host.name:13562 with 3 to fetcher#33
 2014-08-04 21:09:55,360 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 
 to fake.host.name:13562 to fetcher#33
 2014-08-04 21:12:55,463 WARN fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
 fake.host.name:13562 with 3 map outputs
 java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
 at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
 at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
 at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
 at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
 ...
 2014-08-04 22:03:13,416 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: 
 fake.host.name:13562 freed by fetcher#33 in 271081ms
 2014-08-04 22:04:13,417 INFO fetcher#33 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning 
 fake.host.name:13562 with 3 to fetcher#33
 2014-08-04 22:04:13,417 INFO fetcher#33 
 

[jira] [Updated] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files

2014-08-12 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated MAPREDUCE-4815:
---

Attachment: (was: MAPREDUCE-4815.v2.patch)

 FileOutputCommitter.commitJob can be very slow for jobs with many output files
 --

 Key: MAPREDUCE-4815
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
Reporter: Jason Lowe
Assignee: Siqi Li

 If a job generates many files to commit then the commitJob method call at the 
 end of the job can take minutes.  This is a performance regression from 1.x, 
 as 1.x had the tasks commit directly to the final output directory as they 
 were completing and commitJob had very little to do.  The commit work was 
 processed in parallel and overlapped the processing of outstanding tasks.  In 
 0.23/2.x, the commit is single-threaded and waits until all tasks have 
 completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files

2014-08-12 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated MAPREDUCE-4815:
---

Attachment: (was: MAPREDUCE-4815.v1.patch)

 FileOutputCommitter.commitJob can be very slow for jobs with many output files
 --

 Key: MAPREDUCE-4815
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
Reporter: Jason Lowe
Assignee: Siqi Li

 If a job generates many files to commit then the commitJob method call at the 
 end of the job can take minutes.  This is a performance regression from 1.x, 
 as 1.x had the tasks commit directly to the final output directory as they 
 were completing and commitJob had very little to do.  The commit work was 
 processed in parallel and overlapped the processing of outstanding tasks.  In 
 0.23/2.x, the commit is single-threaded and waits until all tasks have 
 completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-6032) Unable to check mapreduce job status if submitted using a non-default namenode

2014-08-12 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094692#comment-14094692
 ] 

Zhijie Shen commented on MAPREDUCE-6032:


[~benjzh], agree with the solution in general. Here're some comments with the 
patch.

1. Doesn't need JobHistoryUtils: , the logger will take care of the source 
where the log record is generated.
{code}
+  LOG.info(JobHistoryUtils: default file system is set solely  +
+  by core-default.xml therefore -  ignoring);
{code}

2. Maybe you want to make the logic here clearer. And 
path.toUri().getAuthority() \!= null || path.toUri().getScheme()!= null - ?
{code}
+if (fc == null ||
+fc.getDefaultFileSystem().getUri().toString().equals(
+conf.get(CommonConfigurationKeysPublic.FS_DEFAULT_NAME_KEY, )) ||
+path.toUri().getAuthority() != null ||
+path.toUri().getScheme()!= null) {
{code}
Change it to:
{code}
boolean solyInCoreDefault = fc == null;
boolean sameFS = 
fc.getDefaultFileSystem().getUri().toString().equals(conf.get(CommonConfigurationKeysPublic.FS_DEFAULT_NAME_KEY,
 ));
boolean qualified = path.toUri().getAuthority() != null   
path.toUri().getScheme()!= null;
if (solyInCoreDefault  sameFS  qualified) {
...
{code}

3. Is it possible to add a test case in TestJobHistoryEventHandler to verify 
the JobHistoryHandler will write to the default FS as well?

4. You may wan to fix the indents (for those sentences in multiple lines).

5. Unlike makeQualified, the following method will work when stagingDirPath is 
on a different FS than the configure one, right?
{code}
stagingDirFS = FileSystem.get(stagingDirPath.toUri(), conf);
{code}

 Unable to check mapreduce job status if submitted using a non-default namenode
 --

 Key: MAPREDUCE-6032
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6032
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Affects Versions: 2.0.5-alpha, 2.1.1-beta, 2.0.6-alpha, 2.2.0, 2.3.0, 
 2.2.1, 2.4.1
 Environment: Any
Reporter: Benjamin Zhitomirsky
Assignee: Benjamin Zhitomirsky
 Fix For: trunk

 Attachments: MAPREDUCE-6032.patch

   Original Estimate: 24h
  Time Spent: 24h
  Remaining Estimate: 0h

 When MRv2 job container runs in a context of non-default file system 
 JobHistoryUtils.java obtains mapreduce.jobhistory.done-dir and
  mapreduce.jobhistory.intermediate-done-dir as a non-qualified paths (e.g. 
 /mapred/history). This path is considered to belong to the current 
 container's context. As result the application history is being written to 
 another file system and job history server is unable to pick it up, because 
 it expects it to be found on the default file system. Currently providing 
 fully qualified path to those parameters is not supported as well, because of 
 a bug in JobHistoryEventHandler.
 After this fix two scenarios will be supported:
 - mapreduce.jobhistory.done-dir and 
 mapreduce.jobhistory.intermediate-done-dir (and the staging directory BTW) 
 will support a fully qualified path
 - If a non-qualified path is configured then it will always be defaulted to 
 the default file system (core-site.xml). That's how consistency of history 
 location will be archived
 Implementation notes:
  - FileSystem#makeQualified throws exception if specified path belongs to 
 another file system. However FileContext#makeQualified work properly in this 
 case, and this is the meaning of the fix in JobHistoryEventHandler. I was not 
 ready to change behavior FileSystem#makeQualified because much more thought 
 is required. I afraid that many users expect such behavior, and fixing it 
 would break their code.
 - The fix in JobHistoryUtils detects non-default namenode configuration only 
 if it comes from some real configuration: core-default.xml is ignored. This 
 is done primary as a kind of test hook, because otherwise setting 
 fs.defaultFS value during test executions would be always recognized by  
 JobHistoryUtils  as a non-default namenode against 'file:///' specified in 
 core-default.xml. 
 (Remark. Note that makeQualified doesn't behave properly with file:/// 
 filesystem, for example:
 new Path(file:///dir/subdir).makeQualified(new URI(hdfs://server:8020), 
 new Path(/dir))
 Returns: file://server:8020/dir/subdir which doesn't make sense.
 However I don't believe it worth fixing, since nobody really case about local 
 file system besides tests. My fix just ensures that all tests run smoothly by 
 ignoring core-default.xml file system in the logic.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files

2014-08-12 Thread Siqi Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094819#comment-14094819
 ] 

Siqi Li commented on MAPREDUCE-4815:


The approach I took is merging the output of each to a temporary directory 
whenever a task is finished
Assuming output directory is $parentDir/$outputDir

{code}
setupJob() will create
$parentDir/$outputDir_temporary/$attemptID
and
$parentDir/$outputDir_temporary/$attemptID_temporary

setupTask() or on-demand file creation by task will create
$parentDir/$outputDir_temporary/$attemptID_temporary/$taskAttemptID

commitTask() will move everything inside
$parentDir/$outputDir_temporary/$attemptID_temporary/$taskAttemptID
to
$parentDir/$outputDir_temporary/$attemptID

recoverJob() also will move
$parentDir/$outputDir_temporary/$previous_attemptID
to
$parentDir/$outputDir_temporary/$recovering_attemptID

if output directory doesn't exist, commitJob() will simply move 
$parentDir/$outputDir_temporary/$attemptID to $parentDir/$outputDir

if output directory does exist, copy all files from 
$parentDir/$outputDir_temporary/$attemptID to $parentDir/$outputDir
{code}

 FileOutputCommitter.commitJob can be very slow for jobs with many output files
 --

 Key: MAPREDUCE-4815
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
Reporter: Jason Lowe
Assignee: Siqi Li

 If a job generates many files to commit then the commitJob method call at the 
 end of the job can take minutes.  This is a performance regression from 1.x, 
 as 1.x had the tasks commit directly to the final output directory as they 
 were completing and commitJob had very little to do.  The commit work was 
 processed in parallel and overlapped the processing of outstanding tasks.  In 
 0.23/2.x, the commit is single-threaded and waits until all tasks have 
 completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-6033) Users are not allowed to view their own jobs, denied by JobACLsManager

2014-08-12 Thread Yu Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094817#comment-14094817
 ] 

Yu Gao commented on MAPREDUCE-6033:
---

Thank you Devaraj.

The test failed due to the test code passing in null value for userName 
parameter when initiating a JobImpl object. Attaching a new patch to fix this.

 Users are not allowed to view their own jobs, denied by JobACLsManager
 --

 Key: MAPREDUCE-6033
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6033
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Yu Gao
Assignee: Yu Gao
 Attachments: MAPREDUCE-6033.patch, YARN-2407.patch


 Have a Hadoop 2.4.1 cluster with Yarn ACL enabled, and try to submit jobs as 
 a non-admin user user1. The job could be finished successfully, but the 
 running progress was not displayed correctly on the command-line, and I got 
 following in the corresponding ApplicationMaster log:
 INFO [IPC Server handler 0 on 56717] org.apache.hadoop.ipc.Server: IPC Server 
 handler 0 on 56717, call 
 org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getJobReport from 
 9.30.95.26:61024 Call#59 Retry#0
 org.apache.hadoop.security.AccessControlException: User user1 cannot perform 
 operation VIEW_JOB on job_1407456690588_0003
   at 
 org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.verifyAndGetJob(MRClientService.java:191)
   at 
 org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.getJobReport(MRClientService.java:233)
   at 
 org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122)
   at 
 org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at 
 java.security.AccessController.doPrivileged(AccessController.java:366)
   at javax.security.auth.Subject.doAs(Subject.java:572)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1567)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-6033) Users are not allowed to view their own jobs, denied by JobACLsManager

2014-08-12 Thread Yu Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Gao updated MAPREDUCE-6033:
--

Attachment: MAPREDUCE-6033.patch

 Users are not allowed to view their own jobs, denied by JobACLsManager
 --

 Key: MAPREDUCE-6033
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6033
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Yu Gao
Assignee: Yu Gao
 Attachments: MAPREDUCE-6033.patch, YARN-2407.patch


 Have a Hadoop 2.4.1 cluster with Yarn ACL enabled, and try to submit jobs as 
 a non-admin user user1. The job could be finished successfully, but the 
 running progress was not displayed correctly on the command-line, and I got 
 following in the corresponding ApplicationMaster log:
 INFO [IPC Server handler 0 on 56717] org.apache.hadoop.ipc.Server: IPC Server 
 handler 0 on 56717, call 
 org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getJobReport from 
 9.30.95.26:61024 Call#59 Retry#0
 org.apache.hadoop.security.AccessControlException: User user1 cannot perform 
 operation VIEW_JOB on job_1407456690588_0003
   at 
 org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.verifyAndGetJob(MRClientService.java:191)
   at 
 org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.getJobReport(MRClientService.java:233)
   at 
 org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122)
   at 
 org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at 
 java.security.AccessController.doPrivileged(AccessController.java:366)
   at javax.security.auth.Subject.doAs(Subject.java:572)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1567)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-6033) Users are not allowed to view their own jobs, denied by JobACLsManager

2014-08-12 Thread Yu Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Gao updated MAPREDUCE-6033:
--

Status: Patch Available  (was: Open)

 Users are not allowed to view their own jobs, denied by JobACLsManager
 --

 Key: MAPREDUCE-6033
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6033
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Yu Gao
Assignee: Yu Gao
 Attachments: MAPREDUCE-6033.patch, YARN-2407.patch


 Have a Hadoop 2.4.1 cluster with Yarn ACL enabled, and try to submit jobs as 
 a non-admin user user1. The job could be finished successfully, but the 
 running progress was not displayed correctly on the command-line, and I got 
 following in the corresponding ApplicationMaster log:
 INFO [IPC Server handler 0 on 56717] org.apache.hadoop.ipc.Server: IPC Server 
 handler 0 on 56717, call 
 org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getJobReport from 
 9.30.95.26:61024 Call#59 Retry#0
 org.apache.hadoop.security.AccessControlException: User user1 cannot perform 
 operation VIEW_JOB on job_1407456690588_0003
   at 
 org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.verifyAndGetJob(MRClientService.java:191)
   at 
 org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.getJobReport(MRClientService.java:233)
   at 
 org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122)
   at 
 org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at 
 java.security.AccessController.doPrivileged(AccessController.java:366)
   at javax.security.auth.Subject.doAs(Subject.java:572)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1567)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files

2014-08-12 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated MAPREDUCE-4815:
---

Status: Open  (was: Patch Available)

 FileOutputCommitter.commitJob can be very slow for jobs with many output files
 --

 Key: MAPREDUCE-4815
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 2.4.1, 2.0.1-alpha, 0.23.3
Reporter: Jason Lowe
Assignee: Siqi Li
 Attachments: MAPREDUCE-4815.v3.patch


 If a job generates many files to commit then the commitJob method call at the 
 end of the job can take minutes.  This is a performance regression from 1.x, 
 as 1.x had the tasks commit directly to the final output directory as they 
 were completing and commitJob had very little to do.  The commit work was 
 processed in parallel and overlapped the processing of outstanding tasks.  In 
 0.23/2.x, the commit is single-threaded and waits until all tasks have 
 completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files

2014-08-12 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated MAPREDUCE-4815:
---

Attachment: MAPREDUCE-4815.v3.patch

 FileOutputCommitter.commitJob can be very slow for jobs with many output files
 --

 Key: MAPREDUCE-4815
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
Reporter: Jason Lowe
Assignee: Siqi Li
 Attachments: MAPREDUCE-4815.v3.patch


 If a job generates many files to commit then the commitJob method call at the 
 end of the job can take minutes.  This is a performance regression from 1.x, 
 as 1.x had the tasks commit directly to the final output directory as they 
 were completing and commitJob had very little to do.  The commit work was 
 processed in parallel and overlapped the processing of outstanding tasks.  In 
 0.23/2.x, the commit is single-threaded and waits until all tasks have 
 completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files

2014-08-12 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated MAPREDUCE-4815:
---

Status: Patch Available  (was: Open)

 FileOutputCommitter.commitJob can be very slow for jobs with many output files
 --

 Key: MAPREDUCE-4815
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 2.4.1, 2.0.1-alpha, 0.23.3
Reporter: Jason Lowe
Assignee: Siqi Li
 Attachments: MAPREDUCE-4815.v3.patch


 If a job generates many files to commit then the commitJob method call at the 
 end of the job can take minutes.  This is a performance regression from 1.x, 
 as 1.x had the tasks commit directly to the final output directory as they 
 were completing and commitJob had very little to do.  The commit work was 
 processed in parallel and overlapped the processing of outstanding tasks.  In 
 0.23/2.x, the commit is single-threaded and waits until all tasks have 
 completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-6033) Users are not allowed to view their own jobs, denied by JobACLsManager

2014-08-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094919#comment-14094919
 ] 

Hadoop QA commented on MAPREDUCE-6033:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12661322/MAPREDUCE-6033.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4800//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4800//console

This message is automatically generated.

 Users are not allowed to view their own jobs, denied by JobACLsManager
 --

 Key: MAPREDUCE-6033
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6033
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Yu Gao
Assignee: Yu Gao
 Attachments: MAPREDUCE-6033.patch, YARN-2407.patch


 Have a Hadoop 2.4.1 cluster with Yarn ACL enabled, and try to submit jobs as 
 a non-admin user user1. The job could be finished successfully, but the 
 running progress was not displayed correctly on the command-line, and I got 
 following in the corresponding ApplicationMaster log:
 INFO [IPC Server handler 0 on 56717] org.apache.hadoop.ipc.Server: IPC Server 
 handler 0 on 56717, call 
 org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getJobReport from 
 9.30.95.26:61024 Call#59 Retry#0
 org.apache.hadoop.security.AccessControlException: User user1 cannot perform 
 operation VIEW_JOB on job_1407456690588_0003
   at 
 org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.verifyAndGetJob(MRClientService.java:191)
   at 
 org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.getJobReport(MRClientService.java:233)
   at 
 org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122)
   at 
 org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at 
 java.security.AccessController.doPrivileged(AccessController.java:366)
   at javax.security.auth.Subject.doAs(Subject.java:572)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1567)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files

2014-08-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094931#comment-14094931
 ] 

Hadoop QA commented on MAPREDUCE-4815:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12661324/MAPREDUCE-4815.v3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4801//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4801//console

This message is automatically generated.

 FileOutputCommitter.commitJob can be very slow for jobs with many output files
 --

 Key: MAPREDUCE-4815
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
Reporter: Jason Lowe
Assignee: Siqi Li
 Attachments: MAPREDUCE-4815.v3.patch


 If a job generates many files to commit then the commitJob method call at the 
 end of the job can take minutes.  This is a performance regression from 1.x, 
 as 1.x had the tasks commit directly to the final output directory as they 
 were completing and commitJob had very little to do.  The commit work was 
 processed in parallel and overlapped the processing of outstanding tasks.  In 
 0.23/2.x, the commit is single-threaded and waits until all tasks have 
 completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5969) Private non-Archive Files' size add twice in Distributed Cache directory size calculation.

2014-08-12 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095019#comment-14095019
 ] 

zhihai xu commented on MAPREDUCE-5969:
--

[~kasha] - I checked the MR2(trunk/branch-2) source code, the implementation is 
 totally different from MR1(branch-1).
The MR2(trunk/branch-2) use LocalizedResource to manage the cache size. The 
LocalizedResource is created in LocalResourcesTrackerImpl after receive 
ContainerLocalizationRequestEvent(ContainerInitEvent) to request LocalResource 
from ContainerLaunchContext (container.launchContext). ContainerLaunchContext 
is created in TaskAttemptImpl.java(createContainerLaunchContext) and 
YARNRunner.java(createApplicationSubmissionContext).
LocalResource in  ContainerLaunchContext is created by {code} 
MRApps.setupDistributedCache(conf, localResources) {code}.
So MR2(trunk/branch-2) doesn't have this issue.

The following is the size calculation after received ResourceLocalizedEvent in 
LocalizedResource.java:
{code}
  private static class FetchSuccessTransition extends ResourceTransition {
@Override
public void transition(LocalizedResource rsrc, ResourceEvent event) {
  ResourceLocalizedEvent locEvent = (ResourceLocalizedEvent) event;
  rsrc.localPath =
  Path.getPathWithoutSchemeAndAuthority(locEvent.getLocation());
  rsrc.size = locEvent.getSize();
  for (ContainerId container : rsrc.ref) {
rsrc.dispatcher.getEventHandler().handle(
new ContainerResourceLocalizedEvent(
  container, rsrc.rsrc, rsrc.localPath));
  }
}
  }
{code}

the size in ResourceLocalizedEvent is in the following 
code(ResourceLocalizationService.java):
For public resource:
{code}
  publicRsrc.handle(new ResourceLocalizedEvent(key, local, FileUtil
.getDU(new File(local.toUri();
{code}
For private resource:
{code}
getLocalResourcesTracker(req.getVisibility(), user, applicationId)
  .handle(
new ResourceLocalizedEvent(req, ConverterUtils
  .getPathFromYarnURL(stat.getLocalPath()), 
stat.getLocalSize()));
{code}

The cache cleanup is at the following code:  
{code}
// from ResourceLocalizationService.java
private void handleCacheCleanup(LocalizationEvent event) {
ResourceRetentionSet retain =
  new ResourceRetentionSet(delService, cacheTargetSize);
retain.addResources(publicRsrc);
LOG.debug(Resource cleanup (public)  + retain);
for (LocalResourcesTracker t : privateRsrc.values()) {
  retain.addResources(t);
  LOG.debug(Resource cleanup  + t.getUser() + : + retain);
}
//TODO Check if appRsrcs should also be added to the retention set.
  }

  // from ResourceRetentionSet.java
  public void addResources(LocalResourcesTracker newTracker) {
for (LocalizedResource resource : newTracker) {
  currentSize += resource.getSize();
  if (resource.getRefCount()  0) {
// always retain resources in use
continue;
  }
  retain.put(resource, newTracker);
}
for (IteratorMap.EntryLocalizedResource,LocalResourcesTracker i =
   retain.entrySet().iterator();
 currentSize - delSize  targetSize  i.hasNext();) {
  Map.EntryLocalizedResource,LocalResourcesTracker rsrc = i.next();
  LocalizedResource resource = rsrc.getKey();
  LocalResourcesTracker tracker = rsrc.getValue();
  if (tracker.remove(resource, delService)) {
delSize += resource.getSize();
i.remove();
  }
}
  }
{code}

And It should be only one copy of LocalizedResource for each 
LocalResourceRequest is saved in publicRsrc or privateRsrc.
So this issue should only happen for MR1(branch-1).

 Private non-Archive Files' size add twice in Distributed Cache directory size 
 calculation.
 --

 Key: MAPREDUCE-5969
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5969
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv1
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: MAPREDUCE-5969.branch1.patch


 Private non-Archive Files' size add twice in Distributed Cache directory size 
 calculation. Private non-Archive Files list is passed in by -files command 
 line option. The Distributed Cache directory size is used to check whether 
 the total cache files size exceed the cache size limitation,  the default 
 cache size limitation is 10G.
 I add log in addCacheInfoUpdate and setSize in 
 TrackerDistributedCacheManager.java.
 I use the following command to test:
 hadoop jar ./wordcount.jar org.apache.hadoop.examples.WordCount -files 
 hdfs://host:8022/tmp/zxu/WordCount.java,hdfs://host:8022/tmp/zxu/wordcount.jar
  /tmp/zxu/test_in/ /tmp/zxu/test_out
 to add two files into