[jira] [Commented] (MAPREDUCE-6024) java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour
[ https://issues.apache.org/jira/browse/MAPREDUCE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093835#comment-14093835 ] Zhijie Shen commented on MAPREDUCE-6024: bq. 1. For MAX_FETCH_FAILURES_NOTIFICATIONS, if change to proportional to the number of reducers, it will be same as MAX_ALLOWED_FETCH_FAILURES_FRACTION, so I deleted it. I do believe Sounds good to me. Under existing defaults, the only cases that failure will be triggered previously but not after the patch is fetchFailures = 2 and shufflingReduceTasks =3. According to the problem described in this jira, it makes sense to give fewer chances to the smaller number of reducer tasks. And if users really want to give the fetcher enough chance, it can tune MAX_ALLOWED_FETCH_FAILURES_FRACTION, and even make it go beyond 1.0. bq. 4. Sometimes fetcher can get data successfully after retry from SocketTimeoutException, so I think let fetcher retry some times is OK. Sounds reasonable. In addition, I linked back to the previous comments in [MAPREDUCE-4772|https://issues.apache.org/jira/browse/MAPREDUCE-4772?focusedCommentId=13492593page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13492593], which said connect exception more severe than timeout. [~venkateshrin], do you have any further comments? Some more comments: 1. maxfetchfailuresfraction - max-fetch-failures-fraction? and maxhostfailures - max-host-failures? {code} + public static final String MAX_ALLOWED_FETCH_FAILURES_FRACTION = mapreduce.reduce.shuffle.maxfetchfailuresfraction; {code} {code} + public static final String MAX_SHUFFLE_FETCH_HOST_FAILURES = mapreduce.reduce.shuffle.maxhostfailures; {code} 2. Is it necessary to multiply the failures by numMaps? copyFailed is in a loop and invoked for each remaining/failed task, right? {code} +//report failure if already retried maxHostFailures times +boolean hostFail = hostFailures.get(hostname).get() this.maxHostFailures +* numMaps ? true : false; {code} BTW, you may want to click Submit Patch to ask Jenkins to verify your patch. java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour - Key: MAPREDUCE-6024 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6024 Project: Hadoop Map/Reduce Issue Type: Improvement Components: mr-am, task Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Critical Attachments: MAPREDUCE-6024.1.patch, MAPREDUCE-6024.patch 2014-08-04 21:09:42,356 WARN fetcher#33 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to fake.host.name:13562 with 2 map outputs java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165) 2014-08-04 21:09:42,360 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: fake.host.name:13562 freed by fetcher#33 in 180024ms 2014-08-04 21:09:55,360 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning fake.host.name:13562 with 3 to fetcher#33 2014-08-04 21:09:55,360 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 to fake.host.name:13562 to fetcher#33 2014-08-04 21:12:55,463 WARN fetcher#33 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to fake.host.name:13562 with 3 map outputs java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289) at
[jira] [Updated] (MAPREDUCE-6024) java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour
[ https://issues.apache.org/jira/browse/MAPREDUCE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated MAPREDUCE-6024: Status: Patch Available (was: Open) java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour - Key: MAPREDUCE-6024 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6024 Project: Hadoop Map/Reduce Issue Type: Improvement Components: mr-am, task Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Critical Attachments: MAPREDUCE-6024.1.patch, MAPREDUCE-6024.2.patch, MAPREDUCE-6024.patch 2014-08-04 21:09:42,356 WARN fetcher#33 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to fake.host.name:13562 with 2 map outputs java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165) 2014-08-04 21:09:42,360 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: fake.host.name:13562 freed by fetcher#33 in 180024ms 2014-08-04 21:09:55,360 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning fake.host.name:13562 with 3 to fetcher#33 2014-08-04 21:09:55,360 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 to fake.host.name:13562 to fetcher#33 2014-08-04 21:12:55,463 WARN fetcher#33 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to fake.host.name:13562 with 3 map outputs java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165) ... 2014-08-04 22:03:13,416 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: fake.host.name:13562 freed by fetcher#33 in 271081ms 2014-08-04 22:04:13,417 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning fake.host.name:13562 with 3 to fetcher#33 2014-08-04 22:04:13,417 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 to fake.host.name:13562 to fetcher#33 2014-08-04 22:07:13,449 WARN fetcher#33 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to fake.host.name:13562 with 3 map outputs java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-6024) java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour
[ https://issues.apache.org/jira/browse/MAPREDUCE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated MAPREDUCE-6024: Attachment: MAPREDUCE-6024.2.patch Update patches changed: maxfetchfailuresfraction - max-fetch-failures-fraction and maxhostfailures - max-host-failures It's necessary to multiply the failures by numMaps. Because when SocketTimeoutException happened, copyFailed will add numMaps to hostFailures. java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour - Key: MAPREDUCE-6024 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6024 Project: Hadoop Map/Reduce Issue Type: Improvement Components: mr-am, task Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Critical Attachments: MAPREDUCE-6024.1.patch, MAPREDUCE-6024.2.patch, MAPREDUCE-6024.patch 2014-08-04 21:09:42,356 WARN fetcher#33 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to fake.host.name:13562 with 2 map outputs java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165) 2014-08-04 21:09:42,360 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: fake.host.name:13562 freed by fetcher#33 in 180024ms 2014-08-04 21:09:55,360 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning fake.host.name:13562 with 3 to fetcher#33 2014-08-04 21:09:55,360 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 to fake.host.name:13562 to fetcher#33 2014-08-04 21:12:55,463 WARN fetcher#33 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to fake.host.name:13562 with 3 map outputs java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165) ... 2014-08-04 22:03:13,416 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: fake.host.name:13562 freed by fetcher#33 in 271081ms 2014-08-04 22:04:13,417 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning fake.host.name:13562 with 3 to fetcher#33 2014-08-04 22:04:13,417 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 to fake.host.name:13562 to fetcher#33 2014-08-04 22:07:13,449 WARN fetcher#33 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to fake.host.name:13562 with 3 map outputs java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Moved] (MAPREDUCE-6033) Users are not allowed to view their own jobs, denied by JobACLsManager
[ https://issues.apache.org/jira/browse/MAPREDUCE-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K moved YARN-2407 to MAPREDUCE-6033: Component/s: (was: applications) Affects Version/s: (was: 2.4.1) 2.4.1 Key: MAPREDUCE-6033 (was: YARN-2407) Project: Hadoop Map/Reduce (was: Hadoop YARN) Users are not allowed to view their own jobs, denied by JobACLsManager -- Key: MAPREDUCE-6033 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6033 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.4.1 Reporter: Yu Gao Assignee: Yu Gao Attachments: YARN-2407.patch Have a Hadoop 2.4.1 cluster with Yarn ACL enabled, and try to submit jobs as a non-admin user user1. The job could be finished successfully, but the running progress was not displayed correctly on the command-line, and I got following in the corresponding ApplicationMaster log: INFO [IPC Server handler 0 on 56717] org.apache.hadoop.ipc.Server: IPC Server handler 0 on 56717, call org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getJobReport from 9.30.95.26:61024 Call#59 Retry#0 org.apache.hadoop.security.AccessControlException: User user1 cannot perform operation VIEW_JOB on job_1407456690588_0003 at org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.verifyAndGetJob(MRClientService.java:191) at org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.getJobReport(MRClientService.java:233) at org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122) at org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(AccessController.java:366) at javax.security.auth.Subject.doAs(Subject.java:572) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1567) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-6024) java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour
[ https://issues.apache.org/jira/browse/MAPREDUCE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093995#comment-14093995 ] Hadoop QA commented on MAPREDUCE-6024: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12661164/MAPREDUCE-6024.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4799//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4799//console This message is automatically generated. java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour - Key: MAPREDUCE-6024 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6024 Project: Hadoop Map/Reduce Issue Type: Improvement Components: mr-am, task Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Critical Attachments: MAPREDUCE-6024.1.patch, MAPREDUCE-6024.2.patch, MAPREDUCE-6024.patch 2014-08-04 21:09:42,356 WARN fetcher#33 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to fake.host.name:13562 with 2 map outputs java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165) 2014-08-04 21:09:42,360 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: fake.host.name:13562 freed by fetcher#33 in 180024ms 2014-08-04 21:09:55,360 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning fake.host.name:13562 with 3 to fetcher#33 2014-08-04 21:09:55,360 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 to fake.host.name:13562 to fetcher#33 2014-08-04 21:12:55,463 WARN fetcher#33 org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to fake.host.name:13562 with 3 map outputs java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165) ... 2014-08-04 22:03:13,416 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: fake.host.name:13562 freed by fetcher#33 in 271081ms 2014-08-04 22:04:13,417 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning fake.host.name:13562 with 3 to fetcher#33 2014-08-04 22:04:13,417 INFO fetcher#33
[jira] [Updated] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
[ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated MAPREDUCE-4815: --- Attachment: (was: MAPREDUCE-4815.v2.patch) FileOutputCommitter.commitJob can be very slow for jobs with many output files -- Key: MAPREDUCE-4815 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1 Reporter: Jason Lowe Assignee: Siqi Li If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. This is a performance regression from 1.x, as 1.x had the tasks commit directly to the final output directory as they were completing and commitJob had very little to do. The commit work was processed in parallel and overlapped the processing of outstanding tasks. In 0.23/2.x, the commit is single-threaded and waits until all tasks have completed before commencing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
[ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated MAPREDUCE-4815: --- Attachment: (was: MAPREDUCE-4815.v1.patch) FileOutputCommitter.commitJob can be very slow for jobs with many output files -- Key: MAPREDUCE-4815 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1 Reporter: Jason Lowe Assignee: Siqi Li If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. This is a performance regression from 1.x, as 1.x had the tasks commit directly to the final output directory as they were completing and commitJob had very little to do. The commit work was processed in parallel and overlapped the processing of outstanding tasks. In 0.23/2.x, the commit is single-threaded and waits until all tasks have completed before commencing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-6032) Unable to check mapreduce job status if submitted using a non-default namenode
[ https://issues.apache.org/jira/browse/MAPREDUCE-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094692#comment-14094692 ] Zhijie Shen commented on MAPREDUCE-6032: [~benjzh], agree with the solution in general. Here're some comments with the patch. 1. Doesn't need JobHistoryUtils: , the logger will take care of the source where the log record is generated. {code} + LOG.info(JobHistoryUtils: default file system is set solely + + by core-default.xml therefore - ignoring); {code} 2. Maybe you want to make the logic here clearer. And path.toUri().getAuthority() \!= null || path.toUri().getScheme()!= null - ? {code} +if (fc == null || +fc.getDefaultFileSystem().getUri().toString().equals( +conf.get(CommonConfigurationKeysPublic.FS_DEFAULT_NAME_KEY, )) || +path.toUri().getAuthority() != null || +path.toUri().getScheme()!= null) { {code} Change it to: {code} boolean solyInCoreDefault = fc == null; boolean sameFS = fc.getDefaultFileSystem().getUri().toString().equals(conf.get(CommonConfigurationKeysPublic.FS_DEFAULT_NAME_KEY, )); boolean qualified = path.toUri().getAuthority() != null path.toUri().getScheme()!= null; if (solyInCoreDefault sameFS qualified) { ... {code} 3. Is it possible to add a test case in TestJobHistoryEventHandler to verify the JobHistoryHandler will write to the default FS as well? 4. You may wan to fix the indents (for those sentences in multiple lines). 5. Unlike makeQualified, the following method will work when stagingDirPath is on a different FS than the configure one, right? {code} stagingDirFS = FileSystem.get(stagingDirPath.toUri(), conf); {code} Unable to check mapreduce job status if submitted using a non-default namenode -- Key: MAPREDUCE-6032 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6032 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobhistoryserver Affects Versions: 2.0.5-alpha, 2.1.1-beta, 2.0.6-alpha, 2.2.0, 2.3.0, 2.2.1, 2.4.1 Environment: Any Reporter: Benjamin Zhitomirsky Assignee: Benjamin Zhitomirsky Fix For: trunk Attachments: MAPREDUCE-6032.patch Original Estimate: 24h Time Spent: 24h Remaining Estimate: 0h When MRv2 job container runs in a context of non-default file system JobHistoryUtils.java obtains mapreduce.jobhistory.done-dir and mapreduce.jobhistory.intermediate-done-dir as a non-qualified paths (e.g. /mapred/history). This path is considered to belong to the current container's context. As result the application history is being written to another file system and job history server is unable to pick it up, because it expects it to be found on the default file system. Currently providing fully qualified path to those parameters is not supported as well, because of a bug in JobHistoryEventHandler. After this fix two scenarios will be supported: - mapreduce.jobhistory.done-dir and mapreduce.jobhistory.intermediate-done-dir (and the staging directory BTW) will support a fully qualified path - If a non-qualified path is configured then it will always be defaulted to the default file system (core-site.xml). That's how consistency of history location will be archived Implementation notes: - FileSystem#makeQualified throws exception if specified path belongs to another file system. However FileContext#makeQualified work properly in this case, and this is the meaning of the fix in JobHistoryEventHandler. I was not ready to change behavior FileSystem#makeQualified because much more thought is required. I afraid that many users expect such behavior, and fixing it would break their code. - The fix in JobHistoryUtils detects non-default namenode configuration only if it comes from some real configuration: core-default.xml is ignored. This is done primary as a kind of test hook, because otherwise setting fs.defaultFS value during test executions would be always recognized by JobHistoryUtils as a non-default namenode against 'file:///' specified in core-default.xml. (Remark. Note that makeQualified doesn't behave properly with file:/// filesystem, for example: new Path(file:///dir/subdir).makeQualified(new URI(hdfs://server:8020), new Path(/dir)) Returns: file://server:8020/dir/subdir which doesn't make sense. However I don't believe it worth fixing, since nobody really case about local file system besides tests. My fix just ensures that all tests run smoothly by ignoring core-default.xml file system in the logic.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
[ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094819#comment-14094819 ] Siqi Li commented on MAPREDUCE-4815: The approach I took is merging the output of each to a temporary directory whenever a task is finished Assuming output directory is $parentDir/$outputDir {code} setupJob() will create $parentDir/$outputDir_temporary/$attemptID and $parentDir/$outputDir_temporary/$attemptID_temporary setupTask() or on-demand file creation by task will create $parentDir/$outputDir_temporary/$attemptID_temporary/$taskAttemptID commitTask() will move everything inside $parentDir/$outputDir_temporary/$attemptID_temporary/$taskAttemptID to $parentDir/$outputDir_temporary/$attemptID recoverJob() also will move $parentDir/$outputDir_temporary/$previous_attemptID to $parentDir/$outputDir_temporary/$recovering_attemptID if output directory doesn't exist, commitJob() will simply move $parentDir/$outputDir_temporary/$attemptID to $parentDir/$outputDir if output directory does exist, copy all files from $parentDir/$outputDir_temporary/$attemptID to $parentDir/$outputDir {code} FileOutputCommitter.commitJob can be very slow for jobs with many output files -- Key: MAPREDUCE-4815 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1 Reporter: Jason Lowe Assignee: Siqi Li If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. This is a performance regression from 1.x, as 1.x had the tasks commit directly to the final output directory as they were completing and commitJob had very little to do. The commit work was processed in parallel and overlapped the processing of outstanding tasks. In 0.23/2.x, the commit is single-threaded and waits until all tasks have completed before commencing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-6033) Users are not allowed to view their own jobs, denied by JobACLsManager
[ https://issues.apache.org/jira/browse/MAPREDUCE-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094817#comment-14094817 ] Yu Gao commented on MAPREDUCE-6033: --- Thank you Devaraj. The test failed due to the test code passing in null value for userName parameter when initiating a JobImpl object. Attaching a new patch to fix this. Users are not allowed to view their own jobs, denied by JobACLsManager -- Key: MAPREDUCE-6033 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6033 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.4.1 Reporter: Yu Gao Assignee: Yu Gao Attachments: MAPREDUCE-6033.patch, YARN-2407.patch Have a Hadoop 2.4.1 cluster with Yarn ACL enabled, and try to submit jobs as a non-admin user user1. The job could be finished successfully, but the running progress was not displayed correctly on the command-line, and I got following in the corresponding ApplicationMaster log: INFO [IPC Server handler 0 on 56717] org.apache.hadoop.ipc.Server: IPC Server handler 0 on 56717, call org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getJobReport from 9.30.95.26:61024 Call#59 Retry#0 org.apache.hadoop.security.AccessControlException: User user1 cannot perform operation VIEW_JOB on job_1407456690588_0003 at org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.verifyAndGetJob(MRClientService.java:191) at org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.getJobReport(MRClientService.java:233) at org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122) at org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(AccessController.java:366) at javax.security.auth.Subject.doAs(Subject.java:572) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1567) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-6033) Users are not allowed to view their own jobs, denied by JobACLsManager
[ https://issues.apache.org/jira/browse/MAPREDUCE-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Gao updated MAPREDUCE-6033: -- Attachment: MAPREDUCE-6033.patch Users are not allowed to view their own jobs, denied by JobACLsManager -- Key: MAPREDUCE-6033 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6033 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.4.1 Reporter: Yu Gao Assignee: Yu Gao Attachments: MAPREDUCE-6033.patch, YARN-2407.patch Have a Hadoop 2.4.1 cluster with Yarn ACL enabled, and try to submit jobs as a non-admin user user1. The job could be finished successfully, but the running progress was not displayed correctly on the command-line, and I got following in the corresponding ApplicationMaster log: INFO [IPC Server handler 0 on 56717] org.apache.hadoop.ipc.Server: IPC Server handler 0 on 56717, call org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getJobReport from 9.30.95.26:61024 Call#59 Retry#0 org.apache.hadoop.security.AccessControlException: User user1 cannot perform operation VIEW_JOB on job_1407456690588_0003 at org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.verifyAndGetJob(MRClientService.java:191) at org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.getJobReport(MRClientService.java:233) at org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122) at org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(AccessController.java:366) at javax.security.auth.Subject.doAs(Subject.java:572) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1567) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-6033) Users are not allowed to view their own jobs, denied by JobACLsManager
[ https://issues.apache.org/jira/browse/MAPREDUCE-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Gao updated MAPREDUCE-6033: -- Status: Patch Available (was: Open) Users are not allowed to view their own jobs, denied by JobACLsManager -- Key: MAPREDUCE-6033 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6033 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.4.1 Reporter: Yu Gao Assignee: Yu Gao Attachments: MAPREDUCE-6033.patch, YARN-2407.patch Have a Hadoop 2.4.1 cluster with Yarn ACL enabled, and try to submit jobs as a non-admin user user1. The job could be finished successfully, but the running progress was not displayed correctly on the command-line, and I got following in the corresponding ApplicationMaster log: INFO [IPC Server handler 0 on 56717] org.apache.hadoop.ipc.Server: IPC Server handler 0 on 56717, call org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getJobReport from 9.30.95.26:61024 Call#59 Retry#0 org.apache.hadoop.security.AccessControlException: User user1 cannot perform operation VIEW_JOB on job_1407456690588_0003 at org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.verifyAndGetJob(MRClientService.java:191) at org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.getJobReport(MRClientService.java:233) at org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122) at org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(AccessController.java:366) at javax.security.auth.Subject.doAs(Subject.java:572) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1567) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
[ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated MAPREDUCE-4815: --- Status: Open (was: Patch Available) FileOutputCommitter.commitJob can be very slow for jobs with many output files -- Key: MAPREDUCE-4815 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 2.4.1, 2.0.1-alpha, 0.23.3 Reporter: Jason Lowe Assignee: Siqi Li Attachments: MAPREDUCE-4815.v3.patch If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. This is a performance regression from 1.x, as 1.x had the tasks commit directly to the final output directory as they were completing and commitJob had very little to do. The commit work was processed in parallel and overlapped the processing of outstanding tasks. In 0.23/2.x, the commit is single-threaded and waits until all tasks have completed before commencing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
[ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated MAPREDUCE-4815: --- Attachment: MAPREDUCE-4815.v3.patch FileOutputCommitter.commitJob can be very slow for jobs with many output files -- Key: MAPREDUCE-4815 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1 Reporter: Jason Lowe Assignee: Siqi Li Attachments: MAPREDUCE-4815.v3.patch If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. This is a performance regression from 1.x, as 1.x had the tasks commit directly to the final output directory as they were completing and commitJob had very little to do. The commit work was processed in parallel and overlapped the processing of outstanding tasks. In 0.23/2.x, the commit is single-threaded and waits until all tasks have completed before commencing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
[ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated MAPREDUCE-4815: --- Status: Patch Available (was: Open) FileOutputCommitter.commitJob can be very slow for jobs with many output files -- Key: MAPREDUCE-4815 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 2.4.1, 2.0.1-alpha, 0.23.3 Reporter: Jason Lowe Assignee: Siqi Li Attachments: MAPREDUCE-4815.v3.patch If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. This is a performance regression from 1.x, as 1.x had the tasks commit directly to the final output directory as they were completing and commitJob had very little to do. The commit work was processed in parallel and overlapped the processing of outstanding tasks. In 0.23/2.x, the commit is single-threaded and waits until all tasks have completed before commencing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-6033) Users are not allowed to view their own jobs, denied by JobACLsManager
[ https://issues.apache.org/jira/browse/MAPREDUCE-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094919#comment-14094919 ] Hadoop QA commented on MAPREDUCE-6033: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12661322/MAPREDUCE-6033.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4800//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4800//console This message is automatically generated. Users are not allowed to view their own jobs, denied by JobACLsManager -- Key: MAPREDUCE-6033 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6033 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.4.1 Reporter: Yu Gao Assignee: Yu Gao Attachments: MAPREDUCE-6033.patch, YARN-2407.patch Have a Hadoop 2.4.1 cluster with Yarn ACL enabled, and try to submit jobs as a non-admin user user1. The job could be finished successfully, but the running progress was not displayed correctly on the command-line, and I got following in the corresponding ApplicationMaster log: INFO [IPC Server handler 0 on 56717] org.apache.hadoop.ipc.Server: IPC Server handler 0 on 56717, call org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getJobReport from 9.30.95.26:61024 Call#59 Retry#0 org.apache.hadoop.security.AccessControlException: User user1 cannot perform operation VIEW_JOB on job_1407456690588_0003 at org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.verifyAndGetJob(MRClientService.java:191) at org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.getJobReport(MRClientService.java:233) at org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122) at org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(AccessController.java:366) at javax.security.auth.Subject.doAs(Subject.java:572) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1567) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
[ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094931#comment-14094931 ] Hadoop QA commented on MAPREDUCE-4815: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12661324/MAPREDUCE-4815.v3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4801//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4801//console This message is automatically generated. FileOutputCommitter.commitJob can be very slow for jobs with many output files -- Key: MAPREDUCE-4815 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1 Reporter: Jason Lowe Assignee: Siqi Li Attachments: MAPREDUCE-4815.v3.patch If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. This is a performance regression from 1.x, as 1.x had the tasks commit directly to the final output directory as they were completing and commitJob had very little to do. The commit work was processed in parallel and overlapped the processing of outstanding tasks. In 0.23/2.x, the commit is single-threaded and waits until all tasks have completed before commencing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5969) Private non-Archive Files' size add twice in Distributed Cache directory size calculation.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095019#comment-14095019 ] zhihai xu commented on MAPREDUCE-5969: -- [~kasha] - I checked the MR2(trunk/branch-2) source code, the implementation is totally different from MR1(branch-1). The MR2(trunk/branch-2) use LocalizedResource to manage the cache size. The LocalizedResource is created in LocalResourcesTrackerImpl after receive ContainerLocalizationRequestEvent(ContainerInitEvent) to request LocalResource from ContainerLaunchContext (container.launchContext). ContainerLaunchContext is created in TaskAttemptImpl.java(createContainerLaunchContext) and YARNRunner.java(createApplicationSubmissionContext). LocalResource in ContainerLaunchContext is created by {code} MRApps.setupDistributedCache(conf, localResources) {code}. So MR2(trunk/branch-2) doesn't have this issue. The following is the size calculation after received ResourceLocalizedEvent in LocalizedResource.java: {code} private static class FetchSuccessTransition extends ResourceTransition { @Override public void transition(LocalizedResource rsrc, ResourceEvent event) { ResourceLocalizedEvent locEvent = (ResourceLocalizedEvent) event; rsrc.localPath = Path.getPathWithoutSchemeAndAuthority(locEvent.getLocation()); rsrc.size = locEvent.getSize(); for (ContainerId container : rsrc.ref) { rsrc.dispatcher.getEventHandler().handle( new ContainerResourceLocalizedEvent( container, rsrc.rsrc, rsrc.localPath)); } } } {code} the size in ResourceLocalizedEvent is in the following code(ResourceLocalizationService.java): For public resource: {code} publicRsrc.handle(new ResourceLocalizedEvent(key, local, FileUtil .getDU(new File(local.toUri(); {code} For private resource: {code} getLocalResourcesTracker(req.getVisibility(), user, applicationId) .handle( new ResourceLocalizedEvent(req, ConverterUtils .getPathFromYarnURL(stat.getLocalPath()), stat.getLocalSize())); {code} The cache cleanup is at the following code: {code} // from ResourceLocalizationService.java private void handleCacheCleanup(LocalizationEvent event) { ResourceRetentionSet retain = new ResourceRetentionSet(delService, cacheTargetSize); retain.addResources(publicRsrc); LOG.debug(Resource cleanup (public) + retain); for (LocalResourcesTracker t : privateRsrc.values()) { retain.addResources(t); LOG.debug(Resource cleanup + t.getUser() + : + retain); } //TODO Check if appRsrcs should also be added to the retention set. } // from ResourceRetentionSet.java public void addResources(LocalResourcesTracker newTracker) { for (LocalizedResource resource : newTracker) { currentSize += resource.getSize(); if (resource.getRefCount() 0) { // always retain resources in use continue; } retain.put(resource, newTracker); } for (IteratorMap.EntryLocalizedResource,LocalResourcesTracker i = retain.entrySet().iterator(); currentSize - delSize targetSize i.hasNext();) { Map.EntryLocalizedResource,LocalResourcesTracker rsrc = i.next(); LocalizedResource resource = rsrc.getKey(); LocalResourcesTracker tracker = rsrc.getValue(); if (tracker.remove(resource, delService)) { delSize += resource.getSize(); i.remove(); } } } {code} And It should be only one copy of LocalizedResource for each LocalResourceRequest is saved in publicRsrc or privateRsrc. So this issue should only happen for MR1(branch-1). Private non-Archive Files' size add twice in Distributed Cache directory size calculation. -- Key: MAPREDUCE-5969 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5969 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv1 Reporter: zhihai xu Assignee: zhihai xu Attachments: MAPREDUCE-5969.branch1.patch Private non-Archive Files' size add twice in Distributed Cache directory size calculation. Private non-Archive Files list is passed in by -files command line option. The Distributed Cache directory size is used to check whether the total cache files size exceed the cache size limitation, the default cache size limitation is 10G. I add log in addCacheInfoUpdate and setSize in TrackerDistributedCacheManager.java. I use the following command to test: hadoop jar ./wordcount.jar org.apache.hadoop.examples.WordCount -files hdfs://host:8022/tmp/zxu/WordCount.java,hdfs://host:8022/tmp/zxu/wordcount.jar /tmp/zxu/test_in/ /tmp/zxu/test_out to add two files into