[jira] [Updated] (YARN-11112) Avoid renewing delegation token when app is first submitted to RM
[ https://issues.apache.org/jira/browse/YARN-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-2: -- Attachment: image-2022-04-19-10-38-01-194.png > Avoid renewing delegation token when app is first submitted to RM > - > > Key: YARN-2 > URL: https://issues.apache.org/jira/browse/YARN-2 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Yuanbo Liu >Priority: Major > Attachments: image-2022-04-19-10-34-59-573.png, > image-2022-04-19-10-38-01-194.png > > > When auth is enabled by NameNode, then delegation token is required if > application needs to acess files/directoies. We find that when app is first > submitted to RM, RM renewer will renew app token no matter whether token is > expired or not. Renewing token is a bit heavy since it uses global write > lock. Here is the result when delegation token is required in a very busy > cluster. > !image-2022-04-19-10-34-59-573.png|width=515,height=302! > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11112) Avoid renewing delegation token when app is first submitted to RM
[ https://issues.apache.org/jira/browse/YARN-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-2: -- Description: When auth is enabled by NameNode, then delegation token is required if application needs to acess files/directoies. We find that when app is first submitted to RM, RM renewer will renew app token no matter whether token is expired or not. Renewing token is a bit heavy since it uses global write lock. Here is the result when delegation token is required in a very busy cluster. !image-2022-04-19-10-34-59-573.png|width=515,height=302! !image-2022-04-19-10-38-01-194.png|width=490,height=290! was: When auth is enabled by NameNode, then delegation token is required if application needs to acess files/directoies. We find that when app is first submitted to RM, RM renewer will renew app token no matter whether token is expired or not. Renewing token is a bit heavy since it uses global write lock. Here is the result when delegation token is required in a very busy cluster. !image-2022-04-19-10-34-59-573.png|width=515,height=302! > Avoid renewing delegation token when app is first submitted to RM > - > > Key: YARN-2 > URL: https://issues.apache.org/jira/browse/YARN-2 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Yuanbo Liu >Priority: Major > Attachments: image-2022-04-19-10-34-59-573.png, > image-2022-04-19-10-38-01-194.png > > > When auth is enabled by NameNode, then delegation token is required if > application needs to acess files/directoies. We find that when app is first > submitted to RM, RM renewer will renew app token no matter whether token is > expired or not. Renewing token is a bit heavy since it uses global write > lock. Here is the result when delegation token is required in a very busy > cluster. > !image-2022-04-19-10-34-59-573.png|width=515,height=302! > !image-2022-04-19-10-38-01-194.png|width=490,height=290! > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11112) Avoid renewing delegation token when app is first submitted to RM
[ https://issues.apache.org/jira/browse/YARN-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-2: -- Issue Type: Improvement (was: Bug) > Avoid renewing delegation token when app is first submitted to RM > - > > Key: YARN-2 > URL: https://issues.apache.org/jira/browse/YARN-2 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Yuanbo Liu >Priority: Major > Attachments: image-2022-04-19-10-34-59-573.png > > > When auth is enabled by NameNode, then delegation token is required if > application needs to acess files/directoies. We find that when app is first > submitted to RM, RM renewer will renew app token no matter whether token is > expired or not. Renewing token is a bit heavy since it uses global write > lock. Here is the result when delegation token is required in a very busy > cluster. > !image-2022-04-19-10-34-59-573.png|width=515,height=302! > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11112) Avoid renewing delegation token when app is first submitted to RM
[ https://issues.apache.org/jira/browse/YARN-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-2: -- Description: When auth is enabled by NameNode, then delegation token is required if application needs to acess files/directoies. We find that when app is first submitted to RM, RM renewer will renew app token no matter whether token is expired or not. Renewing token is a bit heavy since it uses global write lock. Here is the result when delegation token is required in a very busy cluster. !image-2022-04-19-10-34-59-573.png|width=515,height=302! was: When auth is enabled by NameNode, then delegation token is required if application needs to acess files/directoies. We find that when app is first submitted to RM, RM renewer will renew app token no matter whether token is expired or not. Renewing token is a bit heavy since it uses global write lock. Here is the result when delegation token is required in a very busy cluster. !image-2022-04-19-10-34-59-573.png! > Avoid renewing delegation token when app is first submitted to RM > - > > Key: YARN-2 > URL: https://issues.apache.org/jira/browse/YARN-2 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yuanbo Liu >Priority: Major > Attachments: image-2022-04-19-10-34-59-573.png > > > When auth is enabled by NameNode, then delegation token is required if > application needs to acess files/directoies. We find that when app is first > submitted to RM, RM renewer will renew app token no matter whether token is > expired or not. Renewing token is a bit heavy since it uses global write > lock. Here is the result when delegation token is required in a very busy > cluster. > !image-2022-04-19-10-34-59-573.png|width=515,height=302! > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11112) Avoid renewing delegation token when app is first submitted to RM
Yuanbo Liu created YARN-2: - Summary: Avoid renewing delegation token when app is first submitted to RM Key: YARN-2 URL: https://issues.apache.org/jira/browse/YARN-2 Project: Hadoop YARN Issue Type: Bug Reporter: Yuanbo Liu Attachments: image-2022-04-19-10-34-59-573.png When auth is enabled by NameNode, then delegation token is required if application needs to acess files/directoies. We find that when app is first submitted to RM, RM renewer will renew app token no matter whether token is expired or not. Renewing token is a bit heavy since it uses global write lock. Here is the result when delegation token is required in a very busy cluster. !image-2022-04-19-10-34-59-573.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17207806#comment-17207806 ] Yuanbo Liu commented on YARN-10393: --- +1 > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10393.001.patch, YARN-10393.002.patch, > YARN-10393.draft.2.patch, YARN-10393.draft.patch > > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194592#comment-17194592 ] Yuanbo Liu edited comment on YARN-10393 at 9/12/20, 3:48 AM: - [~Jim_Brennan] Thanks for your comments. {quote}My one issue with it is that it will not return the most up to date status for containers on the heartbeat after the missed one... {quote} Great point, your comments remind me that I was wrong, I was thinking about this situation: {quote} [heartbeat:1, container:1]->rm nm not get response [heartbeat:1, container:1, container2]->rm rm think it's duplicated heartbeat and never process container 2. {quote} I thought container 2 would never get updated if heartbeat response missing, which is not correct. Your patch makes sense to me. {quote}Another question is how would these two proposals behave if the NM misses multiple heartbeats in a row? {quote} I believe if it happens, cluster or nn may be in unusable state, resending completed conatianer continuously seems not a bad idea (or we can introduce max retry times )? was (Author: yuanbo): [~Jim_Brennan] Thanks for your comments. {quote}My one issue with it is that it will not return the most up to date status for containers on the heartbeat after the missed one... {quote} Great point, your comments remind me that I was wrong, I was thinking about this situation: {quote} [heartbeat:1, container:1]->rm nm not get response [heartbeat:1, container:1, container2]->rm rm think it's duplicated heartbeat and never process container 2. {quote} I thought container 2 would never get updated if heartbeat response missing, which is not correct. Your patch makes sense to me. {quote}Another question is how would these two proposals behave if the NM misses multiple heartbeats in a row? {quote} I believe if it happens, cluster or nn may be in unusable state, resending completed conatianer continuously seems not a bad idea? > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch > > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} >
[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194592#comment-17194592 ] Yuanbo Liu edited comment on YARN-10393 at 9/12/20, 3:26 AM: - [~Jim_Brennan] Thanks for your comments. {quote}My one issue with it is that it will not return the most up to date status for containers on the heartbeat after the missed one... {quote} Great point, your comments remind me that I was wrong, I was thinking about this situation: {quote} [heartbeat:1, container:1]->rm nm not get response [heartbeat:1, container:1, container2]->rm rm think it's duplicated heartbeat and never process container 2. {quote} I thought container 2 would never get updated if heartbeat response missing, which is not correct. Your patch makes sense to me. {quote}Another question is how would these two proposals behave if the NM misses multiple heartbeats in a row? {quote} I believe if it happens, cluster or nn may be in unusable state, resending completed conatianer continuously seems not a bad idea? was (Author: yuanbo): [~Jim_Brennan] Thanks for your comments. {quote}My one issue with it is that it will not return the most up to date status for containers on the heartbeat after the missed one... {quote} Great point, your comments remind me that I was wrong, I was thinking about this situation: {quote} [heartbeat:1, container:1]->rm nm not get response [heartbeat:1, container:1, container2]->rm rm think it's duplicated heartbeat and never process container 2. {quote} I thought container 2 would never get updated if heartbeat response missing which is not correct. Your patch makes sense to me. {quote}Another question is how would these two proposals behave if the NM misses multiple heartbeats in a row? {quote} I believe if it happens, cluster or nn may be in unusable state, resending completed conatianer continuously seems not a bad idea? > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch > > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR >
[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194592#comment-17194592 ] Yuanbo Liu edited comment on YARN-10393 at 9/12/20, 3:25 AM: - [~Jim_Brennan] Thanks for your comments. {quote}My one issue with it is that it will not return the most up to date status for containers on the heartbeat after the missed one... {quote} Great point, your comments remind me that I was wrong, I was thinking about this situation: {quote} [heartbeat:1, container:1]->rm nm not get response [heartbeat:1, container:1, container2]->rm rm think it's duplicated heartbeat and never process container 2. {quote} I thought container 2 would never get updated if heartbeat response missing which is not correct. Your patch makes sense to me. {quote}Another question is how would these two proposals behave if the NM misses multiple heartbeats in a row? {quote} I believe if it happens, cluster or nn may be in unusable state, resending completed conatianer continuously seems not a bad idea? was (Author: yuanbo): [~Jim_Brennan] Thanks for your comments. {quote}My one issue with it is that it will not return the most up to date status for containers on the heartbeat after the missed one... {quote} Great point, your comments remind me that I was wrong, I was thinking about this situation: {quote} [heartbeat:1, container:1]->rm nm not get response [heartbeat:1, container:1, container2]->rm rm think it's duplicated heartbeat and never process container 2. {quote} I thought container 2 would never get updated if heartbeat response missing. Your patch makes sense to me. {quote}Another question is how would these two proposals behave if the NM misses multiple heartbeats in a row? {quote} I believe if it happens, cluster or nn may be in unusable state, resending completed conatianer continuously seems not a bad idea? > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch > > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR >
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194592#comment-17194592 ] Yuanbo Liu commented on YARN-10393: --- [~Jim_Brennan] Thanks for your comments. {quote}My one issue with it is that it will not return the most up to date status for containers on the heartbeat after the missed one... {quote} Great point, your comments remind me that I was wrong, I was thinking about this situation: {quote} [heartbeat:1, container:1]->rm nm not get response [heartbeat:1, container:1, container2]->rm rm think it's duplicated heartbeat and never process container 2. {quote} I thought container 2 would never get updated if heartbeat response missing. Your patch makes sense to me. {quote}Another question is how would these two proposals behave if the NM misses multiple heartbeats in a row? {quote} I believe if it happens, cluster or nn may be in unusable state, resending completed conatianer continuously seems not a bad idea? > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch > > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-10393: -- Attachment: YARN-10393.draft.patch > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > Attachments: YARN-10393.draft.patch > > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193975#comment-17193975 ] Yuanbo Liu commented on YARN-10393: --- Also, we should avoid adding new container to the same heartBeat id as [~wzzdreamer] has clarified in the descrption. Resending old containers is not avoidable and changing protocol is not a good idea, so we could use pendingCompletedContainers to fix it. I've attached a draft patch for this issue so that we can speed up and conclude our ideas. [~wzzdreamer] feel free to attach a new pr if you have it. [~wzzdreamer] [~Jim_Brennan] [~adam.antal] Any comment will be welcome. > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at >
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188358#comment-17188358 ] Yuanbo Liu commented on YARN-10393: --- [~adam.antal] Thanks for your comments. I misunderstood [~Jim_Brennan] 's solution. +1 for that. BTW, the normal calculation between requestID and responseID would be responseID == requestID + 1 {code:java} // code placeholder if (responseId != lastHeartbeatID + 1) { pendingCompletedContainers.clear(); } {code} correct me if I'm wrong. [~wzzdreamer] What's your thoughts about the solution? > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188276#comment-17188276 ] Yuanbo Liu commented on YARN-10393: --- [~adam.antal] sorry to interrupt, any thoughts about this issue? > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) >
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188270#comment-17188270 ] Yuanbo Liu commented on YARN-10393: --- [~Jim_Brennan] Thanks for the reply. Basically I'm opening to this two options: 1、Change the code of sending heartbeat, send it again if response not received (like [~wzzdreamer] did in the pr), but I'd prefer to introduce some max-retry code in case of potential infinit loop. 2、Resending container id from recentlyStoppedContainers periodically (maybe 1 mins?), once it get response from RM, it will be removed from recentlyStoppedContainers and never get retried again. > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at >
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180959#comment-17180959 ] Yuanbo Liu commented on YARN-10393: --- [~Jim_Brennan] Thanks for the comments. > I think the concern is that if we remove that >pendingCompletedContainers.clear() This would be a potential memory leak if we remove "pendingCompletedContainers.clear()". I'd suggest that removing "!isContainerRecentlyStopped(containerId)" in NodeStatusUpdaterImpl.java[line: 613] would be good to fix this issue. {code:java} if (!isContainerRecentlyStopped(containerId)) { pendingCompletedContainers.put(containerId, containerStatus); }{code} Completed containers will be cached in 10mins(default value) until it timeouts or gets response from heartbeat. And 10mins cache for completed container is long enough for retrying sending requests through heartbeat (default interval is 10s). > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at >
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176820#comment-17176820 ] Yuanbo Liu commented on YARN-10393: --- [~wzzdreamer] Thanks for your patch, I've made some comments about your patch. > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175148#comment-17175148 ] Yuanbo Liu commented on YARN-10393: --- Thanks for opening this issue, we happened to get similar situation on hadoop-2.7.0. The mapper lost heartbeat and nerver finish. Currently we just use "mapred fail-task" to make those mappers in failure state, and re-execute those mappers again. Looking forward to your patch! > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at
[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174287#comment-17174287 ] Yuanbo Liu commented on YARN-10380: --- [~wangda] Thanks for opening this issue, Not sure whether you're working on it. I'd glad to help on it. > Import logic of multi-node allocation in CapacityScheduler > -- > > Key: YARN-10380 > URL: https://issues.apache.org/jira/browse/YARN-10380 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Priority: Critical > > *1) Entry point:* > When we do multi-node allocation, we're using the same logic of async > scheduling: > {code:java} > // Allocate containers of node [start, end) > for (FiCaSchedulerNode node : nodes) { > if (current++ >= start) { > if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) { > continue; > } > cs.allocateContainersToNode(node.getNodeID(), false); > } > } {code} > Is it the most effective way to do multi-node scheduling? Should we allocate > based on partitions? In above logic, if we have thousands of node in one > partition, we will repeatly access all nodes of the partition thousands of > times. > I would suggest looking at making entry-point for node-heartbeat, > async-scheduling (single node), and async-scheduling (multi-node) to be > different. > Node-heartbeat and async-scheduling (single node) can be still similar and > share most of the code. > async-scheduling (multi-node): should iterate partition first, using pseudo > code like: > {code:java} > for (partition : all partitions) { > allocateContainersOnMultiNodes(getCandidate(partition)) > } {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172930#comment-17172930 ] Yuanbo Liu commented on YARN-10380: --- Seems multi-node allocation only works on reserved container assignment. we can reorg those code to improve assigment speed. > Import logic of multi-node allocation in CapacityScheduler > -- > > Key: YARN-10380 > URL: https://issues.apache.org/jira/browse/YARN-10380 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Priority: Critical > > *1) Entry point:* > When we do multi-node allocation, we're using the same logic of async > scheduling: > {code:java} > // Allocate containers of node [start, end) > for (FiCaSchedulerNode node : nodes) { > if (current++ >= start) { > if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) { > continue; > } > cs.allocateContainersToNode(node.getNodeID(), false); > } > } {code} > Is it the most effective way to do multi-node scheduling? Should we allocate > based on partitions? In above logic, if we have thousands of node in one > partition, we will repeatly access all nodes of the partition thousands of > times. > I would suggest looking at making entry-point for node-heartbeat, > async-scheduling (single node), and async-scheduling (multi-node) to be > different. > Node-heartbeat and async-scheduling (single node) can be still similar and > share most of the code. > async-scheduling (multi-node): should iterate partition first, using pseudo > code like: > {code:java} > for (partition : all partitions) { > allocateContainersOnMultiNodes(getCandidate(partition)) > } {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6325) ParentQueue and LeafQueue with same name can cause queue name based operations to fail
[ https://issues.apache.org/jira/browse/YARN-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu reassigned YARN-6325: Assignee: Yuanbo Liu > ParentQueue and LeafQueue with same name can cause queue name based > operations to fail > -- > > Key: YARN-6325 > URL: https://issues.apache.org/jira/browse/YARN-6325 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Jonathan Hung >Assignee: Yuanbo Liu >Priority: Major > Attachments: Screen Shot 2017-03-13 at 2.28.30 PM.png, > capacity-scheduler.xml > > > For example, configure capacity scheduler with two leaf queues: {{root.a.a1}} > and {{root.b.a}}, with {{yarn.scheduler.capacity.root.queues}} as {{b,a}} (in > that order). > Then add a mapping e.g. {{u:username:a}} to {{capacity-scheduler.xml}} and > call {{refreshQueues}}. Operation fails with {noformat}refreshQueues: > java.io.IOException: Failed to re-init queues : mapping contains invalid or > non-leaf queue a > at > org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.logAndWrapException(AdminService.java:866) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:391) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114) > at > org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:522) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:867) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:813) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2653) > Caused by: java.io.IOException: Failed to re-init queues : mapping contains > invalid or non-leaf queue a > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:404) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:396) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:386) > ... 10 more > Caused by: java.io.IOException: mapping contains invalid or non-leaf queue a > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getUserGroupMappingPlacementRule(CapacityScheduler.java:547) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updatePlacementRules(CapacityScheduler.java:571) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:595) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:400) > ... 12 more > {noformat} > Part of the issue is that the {{queues}} map in > {{CapacitySchedulerQueueManager}} stores queues by queue name. We could do > one of a few things: > # Disallow ParentQueues and LeafQueues to have the same queue name. (this > breaks compatibility) > # Store queues by queue path instead of queue name. But this might require > changes in lots of places, e.g. in this case the queue-mappings would have to > map to a queue path instead of a queue name (which also breaks compatibility) > and possibly others. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6325) ParentQueue and LeafQueue with same name can cause queue name based operations to fail
[ https://issues.apache.org/jira/browse/YARN-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825656#comment-16825656 ] Yuanbo Liu commented on YARN-6325: -- [~leftnoteasy] we have such kind of issue in our environment. I'd like to patch it. Any further comment will be welcome. > ParentQueue and LeafQueue with same name can cause queue name based > operations to fail > -- > > Key: YARN-6325 > URL: https://issues.apache.org/jira/browse/YARN-6325 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Jonathan Hung >Priority: Major > Attachments: Screen Shot 2017-03-13 at 2.28.30 PM.png, > capacity-scheduler.xml > > > For example, configure capacity scheduler with two leaf queues: {{root.a.a1}} > and {{root.b.a}}, with {{yarn.scheduler.capacity.root.queues}} as {{b,a}} (in > that order). > Then add a mapping e.g. {{u:username:a}} to {{capacity-scheduler.xml}} and > call {{refreshQueues}}. Operation fails with {noformat}refreshQueues: > java.io.IOException: Failed to re-init queues : mapping contains invalid or > non-leaf queue a > at > org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.logAndWrapException(AdminService.java:866) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:391) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114) > at > org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:522) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:867) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:813) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2653) > Caused by: java.io.IOException: Failed to re-init queues : mapping contains > invalid or non-leaf queue a > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:404) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:396) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:386) > ... 10 more > Caused by: java.io.IOException: mapping contains invalid or non-leaf queue a > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getUserGroupMappingPlacementRule(CapacityScheduler.java:547) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updatePlacementRules(CapacityScheduler.java:571) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:595) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:400) > ... 12 more > {noformat} > Part of the issue is that the {{queues}} map in > {{CapacitySchedulerQueueManager}} stores queues by queue name. We could do > one of a few things: > # Disallow ParentQueues and LeafQueues to have the same queue name. (this > breaks compatibility) > # Store queues by queue path instead of queue name. But this might require > changes in lots of places, e.g. in this case the queue-mappings would have to > map to a queue path instead of a queue name (which also breaks compatibility) > and possibly others. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized
[ https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551551#comment-16551551 ] Yuanbo Liu commented on YARN-8513: -- Sorry for the late response. Quite busy this week. I will go through the dump files today > CapacityScheduler infinite loop when queue is near fully utilized > - > > Key: YARN-8513 > URL: https://issues.apache.org/jira/browse/YARN-8513 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.1 > Environment: Ubuntu 14.04.5 > YARN is configured with one label and 5 queues. >Reporter: Chen Yufei >Priority: Major > Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, > jstack-5.log, top-during-lock.log, top-when-normal.log > > > ResourceManager does not respond to any request when queue is near fully > utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM > restart, it can recover running jobs and start accepting new ones. > > Seems like CapacityScheduler is in an infinite loop printing out the > following log messages (more than 25,000 lines in a second): > > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.99816763 > absoluteUsedCapacity=0.99816763 used= > cluster=}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1530619767030_1652_01 > container=null > queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943 > clusterResource= type=NODE_LOCAL > requestedPartition=}} > > I encounter this problem several times after upgrading to YARN 2.9.1, while > the same configuration works fine under version 2.7.3. > > YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a > similar problem. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6482) TestSLSRunner runs but doesn't executed jobs (.json parsing issue)
[ https://issues.apache.org/jira/browse/YARN-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544074#comment-16544074 ] Yuanbo Liu edited comment on YARN-6482 at 7/14/18 6:09 AM: --- The nodes in the rumen file are not correct. attach v1 patch to fix this issue. [~djp] / [~cheersyang] Please take a look if you get time. Thanks in advance. was (Author: yuanbo): The nodes in the rumen file are not correct. attach v1 patch to fix this issue. [~djp] Please take a look if you get time. Thanks in advance. > TestSLSRunner runs but doesn't executed jobs (.json parsing issue) > -- > > Key: YARN-6482 > URL: https://issues.apache.org/jira/browse/YARN-6482 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Carlo Curino >Assignee: Yuanbo Liu >Priority: Minor > Attachments: YARN-6482.001.patch > > > The TestSLSRunner runs correctly brining up and RM, but the parsing of the > rumen trace fails somehow silently, and no nodes nor jobs are loaded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6482) TestSLSRunner runs but doesn't executed jobs (.json parsing issue)
[ https://issues.apache.org/jira/browse/YARN-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544074#comment-16544074 ] Yuanbo Liu commented on YARN-6482: -- The nodes in the rumen file are not correct. attach v1 patch to fix this issue. [~djp] Please take a look if you get time. Thanks in advance. > TestSLSRunner runs but doesn't executed jobs (.json parsing issue) > -- > > Key: YARN-6482 > URL: https://issues.apache.org/jira/browse/YARN-6482 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Carlo Curino >Assignee: Yuanbo Liu >Priority: Minor > Attachments: YARN-6482.001.patch > > > The TestSLSRunner runs correctly brining up and RM, but the parsing of the > rumen trace fails somehow silently, and no nodes nor jobs are loaded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6482) TestSLSRunner runs but doesn't executed jobs (.json parsing issue)
[ https://issues.apache.org/jira/browse/YARN-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-6482: - Attachment: YARN-6482.001.patch > TestSLSRunner runs but doesn't executed jobs (.json parsing issue) > -- > > Key: YARN-6482 > URL: https://issues.apache.org/jira/browse/YARN-6482 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Carlo Curino >Assignee: Yuanbo Liu >Priority: Minor > Attachments: YARN-6482.001.patch > > > The TestSLSRunner runs correctly brining up and RM, but the parsing of the > rumen trace fails somehow silently, and no nodes nor jobs are loaded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8510) Timezone offset in YARN UI does not format minutes component correctly.
[ https://issues.apache.org/jira/browse/YARN-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542729#comment-16542729 ] Yuanbo Liu commented on YARN-8510: -- [~tmoschou] Can you give the snapshot of the YARN UI so that we can address the issue quickly > Timezone offset in YARN UI does not format minutes component correctly. > --- > > Key: YARN-8510 > URL: https://issues.apache.org/jira/browse/YARN-8510 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.4 >Reporter: Terry Moschou >Priority: Trivial > > Offsets with a non-zero mm component like +9.5hr (ACST) are formatted as > +0950 rather than +0930 in the YARN UI. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized
[ https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542719#comment-16542719 ] Yuanbo Liu commented on YARN-8513: -- [~cyfdecyf] Can you reproduce this issue and capture the stack of RM # jstack -F pid # top -H -p pid then attach those info in this Jira so that we can figure out the cause of infinite loop here. > CapacityScheduler infinite loop when queue is near fully utilized > - > > Key: YARN-8513 > URL: https://issues.apache.org/jira/browse/YARN-8513 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.1 > Environment: Ubuntu 14.04.5 > YARN is configured with one label and 5 queues. >Reporter: Chen Yufei >Priority: Major > > ResourceManager does not respond to any request when queue is near fully > utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM > restart, it can recover running jobs and start accepting new ones. > > Seems like CapacityScheduler is in an infinite loop printing out the > following log messages (more than 25,000 lines in a second): > > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.99816763 > absoluteUsedCapacity=0.99816763 used= > cluster=}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1530619767030_1652_01 > container=null > queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943 > clusterResource= type=NODE_LOCAL > requestedPartition=}} > > I encounter this problem several times after upgrading to YARN 2.9.1, while > the same configuration works fine under version 2.7.3. > > YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a > similar problem. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6482) TestSLSRunner runs but doesn't executed jobs (.json parsing issue)
[ https://issues.apache.org/jira/browse/YARN-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536702#comment-16536702 ] Yuanbo Liu commented on YARN-6482: -- Sorry, I will update this Jira shortly > TestSLSRunner runs but doesn't executed jobs (.json parsing issue) > -- > > Key: YARN-6482 > URL: https://issues.apache.org/jira/browse/YARN-6482 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Carlo Curino >Assignee: Yuanbo Liu >Priority: Minor > > The TestSLSRunner runs correctly brining up and RM, but the parsing of the > rumen trace fails somehow silently, and no nodes nor jobs are loaded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently
[ https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-6265: - Attachment: YARN-6265.003.patch > yarn.resourcemanager.fail-fast is used inconsistently > - > > Key: YARN-6265 > URL: https://issues.apache.org/jira/browse/YARN-6265 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daniel Templeton >Assignee: Yuanbo Liu >Priority: Major > Attachments: YARN-6265.001.patch, YARN-6265.002.patch, > YARN-6265.003.patch > > > In capacity scheduler, the property is used to control whether an app with > no/bad queue should be killed. In the state store, the property controls > whether a state store op failure should cause the RM to exit in non-HA mode. > Those are two very different things, and they should be separated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently
[ https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528987#comment-16528987 ] Yuanbo Liu commented on YARN-6265: -- rebase the patch. > yarn.resourcemanager.fail-fast is used inconsistently > - > > Key: YARN-6265 > URL: https://issues.apache.org/jira/browse/YARN-6265 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daniel Templeton >Assignee: Yuanbo Liu >Priority: Major > Attachments: YARN-6265.001.patch, YARN-6265.002.patch, > YARN-6265.003.patch > > > In capacity scheduler, the property is used to control whether an app with > no/bad queue should be killed. In the state store, the property controls > whether a state store op failure should cause the RM to exit in non-HA mode. > Those are two very different things, and they should be separated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6482) TestSLSRunner runs but doesn't executed jobs (.json parsing issue)
[ https://issues.apache.org/jira/browse/YARN-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu reassigned YARN-6482: Assignee: Yuanbo Liu > TestSLSRunner runs but doesn't executed jobs (.json parsing issue) > -- > > Key: YARN-6482 > URL: https://issues.apache.org/jira/browse/YARN-6482 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Carlo Curino >Assignee: Yuanbo Liu >Priority: Minor > > The TestSLSRunner runs correctly brining up and RM, but the parsing of the > rumen trace fails somehow silently, and no nodes nor jobs are loaded. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6482) TestSLSRunner runs but doesn't executed jobs (.json parsing issue)
[ https://issues.apache.org/jira/browse/YARN-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973987#comment-15973987 ] Yuanbo Liu commented on YARN-6482: -- Take it over, this defect is introduced by YARN-4612. > TestSLSRunner runs but doesn't executed jobs (.json parsing issue) > -- > > Key: YARN-6482 > URL: https://issues.apache.org/jira/browse/YARN-6482 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Carlo Curino >Priority: Minor > > The TestSLSRunner runs correctly brining up and RM, but the parsing of the > rumen trace fails somehow silently, and no nodes nor jobs are loaded. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently
[ https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967164#comment-15967164 ] Yuanbo Liu commented on YARN-6265: -- The test failures are not related. > yarn.resourcemanager.fail-fast is used inconsistently > - > > Key: YARN-6265 > URL: https://issues.apache.org/jira/browse/YARN-6265 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daniel Templeton >Assignee: Yuanbo Liu > Attachments: YARN-6265.001.patch, YARN-6265.002.patch > > > In capacity scheduler, the property is used to control whether an app with > no/bad queue should be killed. In the state store, the property controls > whether a state store op failure should cause the RM to exit in non-HA mode. > Those are two very different things, and they should be separated. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently
[ https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958341#comment-15958341 ] Yuanbo Liu commented on YARN-6265: -- [~djp] and [~templedf] Thanks for your review. Upload v2 patch to address your comments. > yarn.resourcemanager.fail-fast is used inconsistently > - > > Key: YARN-6265 > URL: https://issues.apache.org/jira/browse/YARN-6265 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daniel Templeton >Assignee: Yuanbo Liu > Attachments: YARN-6265.001.patch, YARN-6265.002.patch > > > In capacity scheduler, the property is used to control whether an app with > no/bad queue should be killed. In the state store, the property controls > whether a state store op failure should cause the RM to exit in non-HA mode. > Those are two very different things, and they should be separated. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently
[ https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-6265: - Attachment: YARN-6265.002.patch > yarn.resourcemanager.fail-fast is used inconsistently > - > > Key: YARN-6265 > URL: https://issues.apache.org/jira/browse/YARN-6265 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daniel Templeton >Assignee: Yuanbo Liu > Attachments: YARN-6265.001.patch, YARN-6265.002.patch > > > In capacity scheduler, the property is used to control whether an app with > no/bad queue should be killed. In the state store, the property controls > whether a state store op failure should cause the RM to exit in non-HA mode. > Those are two very different things, and they should be separated. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently
[ https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937643#comment-15937643 ] Yuanbo Liu commented on YARN-6265: -- Sorry to interrupt, it has been a while since the last update. It would be great if somebody looks into my patch and give some thoughts. Thanks in advance. > yarn.resourcemanager.fail-fast is used inconsistently > - > > Key: YARN-6265 > URL: https://issues.apache.org/jira/browse/YARN-6265 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daniel Templeton >Assignee: Yuanbo Liu > Attachments: YARN-6265.001.patch > > > In capacity scheduler, the property is used to control whether an app with > no/bad queue should be killed. In the state store, the property controls > whether a state store op failure should cause the RM to exit in non-HA mode. > Those are two very different things, and they should be separated. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6341) Redirected tracking UI of application is not correct if web policy is transformed from HTTP_ONLY to HTTPS_ONLY
Yuanbo Liu created YARN-6341: Summary: Redirected tracking UI of application is not correct if web policy is transformed from HTTP_ONLY to HTTPS_ONLY Key: YARN-6341 URL: https://issues.apache.org/jira/browse/YARN-6341 Project: Hadoop YARN Issue Type: Bug Reporter: Yuanbo Liu Before users enable hadoop https, they submit a MR job. After the job is finished and web policy is configured as HTTPS_ONLY, users access as following steps: Resource Manager UI -> Applications -> Tracking UI then the address is redirected into a http address of job history server instead of https address. I think this behavior is related to {{WebAppProxyServlet#getTrackingUri}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently
[ https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925400#comment-15925400 ] Yuanbo Liu commented on YARN-6265: -- [~djp] Thanks > yarn.resourcemanager.fail-fast is used inconsistently > - > > Key: YARN-6265 > URL: https://issues.apache.org/jira/browse/YARN-6265 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daniel Templeton >Assignee: Yuanbo Liu > Attachments: YARN-6265.001.patch > > > In capacity scheduler, the property is used to control whether an app with > no/bad queue should be killed. In the state store, the property controls > whether a state store op failure should cause the RM to exit in non-HA mode. > Those are two very different things, and they should be separated. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently
[ https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925368#comment-15925368 ] Yuanbo Liu commented on YARN-6265: -- [~templedf] / [~djp] Does jenkins have any issue? I submitted the patch, and don't get the result report. > yarn.resourcemanager.fail-fast is used inconsistently > - > > Key: YARN-6265 > URL: https://issues.apache.org/jira/browse/YARN-6265 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daniel Templeton >Assignee: Yuanbo Liu > Attachments: YARN-6265.001.patch > > > In capacity scheduler, the property is used to control whether an app with > no/bad queue should be killed. In the state store, the property controls > whether a state store op failure should cause the RM to exit in non-HA mode. > Those are two very different things, and they should be separated. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently
[ https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-6265: - Attachment: YARN-6265.001.patch upload v1 patch for this JIRA. > yarn.resourcemanager.fail-fast is used inconsistently > - > > Key: YARN-6265 > URL: https://issues.apache.org/jira/browse/YARN-6265 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daniel Templeton >Assignee: Yuanbo Liu > Attachments: YARN-6265.001.patch > > > In capacity scheduler, the property is used to control whether an app with > no/bad queue should be killed. In the state store, the property controls > whether a state store op failure should cause the RM to exit in non-HA mode. > Those are two very different things, and they should be separated. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently
[ https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu reassigned YARN-6265: Assignee: Yuanbo Liu > yarn.resourcemanager.fail-fast is used inconsistently > - > > Key: YARN-6265 > URL: https://issues.apache.org/jira/browse/YARN-6265 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daniel Templeton >Assignee: Yuanbo Liu > > In capacity scheduler, the property is used to control whether an app with > no/bad queue should be killed. In the state store, the property controls > whether a state store op failure should cause the RM to exit in non-HA mode. > Those are two very different things, and they should be separated. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6300) NULL_UPDATE_REQUESTS is redundant in TestFairScheduler
[ https://issues.apache.org/jira/browse/YARN-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904329#comment-15904329 ] Yuanbo Liu commented on YARN-6300: -- [~templedf] Thanks for your commit !Selection_124.png! I've seen this patch in branch-2, so I guess I don't need to provide branch-2 patch any more, right? > NULL_UPDATE_REQUESTS is redundant in TestFairScheduler > -- > > Key: YARN-6300 > URL: https://issues.apache.org/jira/browse/YARN-6300 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.0.0-alpha2 >Reporter: Daniel Templeton >Assignee: Yuanbo Liu >Priority: Minor > Labels: newbie > Fix For: 3.0.0-alpha3 > > Attachments: Selection_124.png, YARN-6300.001.patch > > > The {{TestFairScheduler.NULL_UPDATE_REQUESTS}} field hides > {{FairSchedulerTestBase.NULL_UPDATE_REQUESTS}}, which has the same value. > The {{NULL_UPDATE_REQUESTS}} field should be removed from > {{TestFairScheduler}}. > While you're at it, maybe also remove the unused import. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6300) NULL_UPDATE_REQUESTS is redundant in TestFairScheduler
[ https://issues.apache.org/jira/browse/YARN-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-6300: - Attachment: Selection_124.png > NULL_UPDATE_REQUESTS is redundant in TestFairScheduler > -- > > Key: YARN-6300 > URL: https://issues.apache.org/jira/browse/YARN-6300 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.0.0-alpha2 >Reporter: Daniel Templeton >Assignee: Yuanbo Liu >Priority: Minor > Labels: newbie > Fix For: 3.0.0-alpha3 > > Attachments: Selection_124.png, YARN-6300.001.patch > > > The {{TestFairScheduler.NULL_UPDATE_REQUESTS}} field hides > {{FairSchedulerTestBase.NULL_UPDATE_REQUESTS}}, which has the same value. > The {{NULL_UPDATE_REQUESTS}} field should be removed from > {{TestFairScheduler}}. > While you're at it, maybe also remove the unused import. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6300) NULL_UPDATE_REQUESTS is redundant in TestFairScheduler
[ https://issues.apache.org/jira/browse/YARN-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902391#comment-15902391 ] Yuanbo Liu commented on YARN-6300: -- [~haibochen] Thanks for your review. > NULL_UPDATE_REQUESTS is redundant in TestFairScheduler > -- > > Key: YARN-6300 > URL: https://issues.apache.org/jira/browse/YARN-6300 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.0.0-alpha2 >Reporter: Daniel Templeton >Assignee: Yuanbo Liu >Priority: Minor > Labels: newbie > Attachments: YARN-6300.001.patch > > > The {{TestFairScheduler.NULL_UPDATE_REQUESTS}} field hides > {{FairSchedulerTestBase.NULL_UPDATE_REQUESTS}}, which has the same value. > The {{NULL_UPDATE_REQUESTS}} field should be removed from > {{TestFairScheduler}}. > While you're at it, maybe also remove the unused import. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6300) NULL_UPDATE_REQUESTS is redundant in TestFairScheduler
[ https://issues.apache.org/jira/browse/YARN-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-6300: - Attachment: YARN-6300.001.patch upload v1 patch for this JIRA. > NULL_UPDATE_REQUESTS is redundant in TestFairScheduler > -- > > Key: YARN-6300 > URL: https://issues.apache.org/jira/browse/YARN-6300 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.0.0-alpha2 >Reporter: Daniel Templeton >Assignee: Yuanbo Liu >Priority: Minor > Labels: newbie > Attachments: YARN-6300.001.patch > > > The {{TestFairScheduler.NULL_UPDATE_REQUESTS}} field hides > {{FairSchedulerTestBase.NULL_UPDATE_REQUESTS}}, which has the same value. > The {{NULL_UPDATE_REQUESTS}} field should be removed from > {{TestFairScheduler}}. > While you're at it, maybe also remove the unused import. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6300) NULL_UPDATE_REQUESTS is redundant in TestFairScheduler
[ https://issues.apache.org/jira/browse/YARN-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu reassigned YARN-6300: Assignee: Yuanbo Liu > NULL_UPDATE_REQUESTS is redundant in TestFairScheduler > -- > > Key: YARN-6300 > URL: https://issues.apache.org/jira/browse/YARN-6300 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.0.0-alpha2 >Reporter: Daniel Templeton >Assignee: Yuanbo Liu >Priority: Minor > Labels: newbie > > The {{TestFairScheduler.NULL_UPDATE_REQUESTS}} field hides > {{FairSchedulerTestBase.NULL_UPDATE_REQUESTS}}, which has the same value. > The {{NULL_UPDATE_REQUESTS}} field should be removed from > {{TestFairScheduler}}. > While you're at it, maybe also remove the unused import. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1728) Workaround guice3x-undecoded pathInfo in YARN WebApp
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889359#comment-15889359 ] Yuanbo Liu commented on YARN-1728: -- Thanks a lot! > Workaround guice3x-undecoded pathInfo in YARN WebApp > > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Fix For: 2.8.0, 2.7.4, 3.0.0-alpha3 > > Attachments: test-case-for-trunk.patch, YARN-1728-branch-2.001.patch, > YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, > YARN-1728-branch-2.004.patch, YARN-1728-branch-2.005.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887139#comment-15887139 ] Yuanbo Liu commented on YARN-1728: -- [~jira.shegalov] Thanks for your comments. Upload v5 patch and test case for trunk to address your comments. > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Attachments: test-case-for-trunk.patch, YARN-1728-branch-2.001.patch, > YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, > YARN-1728-branch-2.004.patch, YARN-1728-branch-2.005.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-1728: - Attachment: test-case-for-trunk.patch > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Attachments: test-case-for-trunk.patch, YARN-1728-branch-2.001.patch, > YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, > YARN-1728-branch-2.004.patch, YARN-1728-branch-2.005.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-1728: - Attachment: YARN-1728-branch-2.005.patch > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Attachments: YARN-1728-branch-2.001.patch, > YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, > YARN-1728-branch-2.004.patch, YARN-1728-branch-2.005.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-1728: - Attachment: YARN-1728-branch-2.004.patch > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Attachments: YARN-1728-branch-2.001.patch, > YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, > YARN-1728-branch-2.004.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885088#comment-15885088 ] Yuanbo Liu commented on YARN-1728: -- [~jira.shegalov] Thanks for your comments. Agree with you. since the patch of guice uses {{pathInfo = new URI(pathInfo).getPath()}} with try.. catch expression, and {{URI.create}} will raise a run-time exception, I use {{new URI}} instead of {{URI.create}} for compatibility. Upload v4 patch for this JIRA, please review it. > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Attachments: YARN-1728-branch-2.001.patch, > YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-1728: - Attachment: YARN-1728-branch-2.003.patch [~haibochen] Thanks a lot for your comments. Upload v3 patch. > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Attachments: YARN-1728-branch-2.001.patch, > YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879972#comment-15879972 ] Yuanbo Liu commented on YARN-1728: -- [~haibochen] Thanks for your review. Upload v2 patch to address your comments. > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Attachments: YARN-1728-branch-2.001.patch, > YARN-1728-branch-2.002.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-1728: - Attachment: YARN-1728-branch-2.002.patch > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Attachments: YARN-1728-branch-2.001.patch, > YARN-1728-branch-2.002.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871448#comment-15871448 ] Yuanbo Liu commented on YARN-1728: -- [~jira.shegalov] / [~rkanter] Would you mind having a look at my patch and give some thoughts. Thanks in advance! > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Attachments: YARN-1728-branch-2.001.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-1728: - Attachment: YARN-1728-branch-2.001.patch The root cause of this defect is that the third-party jar named guice-3.0 doesn't obey the rule which is introduced by [~jira.shegalov] With HADOOP-12064, this defect will never be a problem in trunk and hadoop-3, but in branch-2, it still exists. I strongly suggest that this defect should be addressed in hadoop-2 because encoded path in url is quite often and the implemented method in guice-3.0 doesn't follow the contract of the interface {{HttpServletRequest#getPathInfo}}. > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > Attachments: YARN-1728-branch-2.001.patch > > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871131#comment-15871131 ] Yuanbo Liu commented on YARN-1728: -- Would like to take it over, and see why the path is not decoded. We've met such kind of defect in hadoop-2.7.3. > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-1728) History server doesn't understand percent encoded paths
[ https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu reassigned YARN-1728: Assignee: Yuanbo Liu > History server doesn't understand percent encoded paths > --- > > Key: YARN-1728 > URL: https://issues.apache.org/jira/browse/YARN-1728 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Abraham Elmahrek >Assignee: Yuanbo Liu > > For example, going to the job history server page > http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > results in the following error: > {code} > Cannot get container logs. Invalid nodeId: > test-cdh5-hue.ent.cloudera.com%3A8041 > {code} > Where the url decoded version works: > http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr > It seems like both should be supported as the former is simply percent > encoding. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6073) Misuse of format specifier in Preconditions.checkArgument
[ https://issues.apache.org/jira/browse/YARN-6073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15814032#comment-15814032 ] Yuanbo Liu commented on YARN-6073: -- [~templedf] Thanks for your review. > Misuse of format specifier in Preconditions.checkArgument > - > > Key: YARN-6073 > URL: https://issues.apache.org/jira/browse/YARN-6073 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yongjun Zhang >Assignee: Yuanbo Liu >Priority: Trivial > Attachments: YARN-6073.001.patch > > > RMAdminCLI.java > {code} > int nLabels = map.get(nodeId).size(); > Preconditions.checkArgument(nLabels <= 1, "%d labels specified on > host=%s" > + ", please note that we do not support specifying multiple" > + " labels on a single host for now.", nLabels, nodeIdStr); > {code} > The {{%d}} should be replaced with {{%s}}, per > https://google.github.io/guava/releases/19.0/api/docs/com/google/common/base/Preconditions.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6073) Misuse of format specifier in Preconditions.checkArgument
[ https://issues.apache.org/jira/browse/YARN-6073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated YARN-6073: - Attachment: YARN-6073.001.patch upload v1 patch for this JIRA. > Misuse of format specifier in Preconditions.checkArgument > - > > Key: YARN-6073 > URL: https://issues.apache.org/jira/browse/YARN-6073 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yongjun Zhang >Assignee: Yuanbo Liu >Priority: Trivial > Attachments: YARN-6073.001.patch > > > RMAdminCLI.java > {code} > int nLabels = map.get(nodeId).size(); > Preconditions.checkArgument(nLabels <= 1, "%d labels specified on > host=%s" > + ", please note that we do not support specifying multiple" > + " labels on a single host for now.", nLabels, nodeIdStr); > {code} > The {{%d}} should be replaced with {{%s}}, per > https://google.github.io/guava/releases/19.0/api/docs/com/google/common/base/Preconditions.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6073) Misuse of format specifier in Preconditions.checkArgument
[ https://issues.apache.org/jira/browse/YARN-6073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu reassigned YARN-6073: Assignee: Yuanbo Liu > Misuse of format specifier in Preconditions.checkArgument > - > > Key: YARN-6073 > URL: https://issues.apache.org/jira/browse/YARN-6073 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yongjun Zhang >Assignee: Yuanbo Liu >Priority: Trivial > > RMAdminCLI.java > {code} > int nLabels = map.get(nodeId).size(); > Preconditions.checkArgument(nLabels <= 1, "%d labels specified on > host=%s" > + ", please note that we do not support specifying multiple" > + " labels on a single host for now.", nLabels, nodeIdStr); > {code} > The {{%d}} should be replaced with {{%s}}, per > https://google.github.io/guava/releases/19.0/api/docs/com/google/common/base/Preconditions.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6073) Misuse of format specifier in Preconditions.checkArgument
[ https://issues.apache.org/jira/browse/YARN-6073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15810780#comment-15810780 ] Yuanbo Liu commented on YARN-6073: -- This JIRA can be my good start of first patch for YARN. [~yzhangal] Would you mind assigning this JIRA for me, since I don't have the privilege to assign YARN JIRA to myself. > Misuse of format specifier in Preconditions.checkArgument > - > > Key: YARN-6073 > URL: https://issues.apache.org/jira/browse/YARN-6073 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yongjun Zhang >Priority: Trivial > > RMAdminCLI.java > {code} > int nLabels = map.get(nodeId).size(); > Preconditions.checkArgument(nLabels <= 1, "%d labels specified on > host=%s" > + ", please note that we do not support specifying multiple" > + " labels on a single host for now.", nLabels, nodeIdStr); > {code} > The {{%d}} should be replaced with {{%s}}, per > https://google.github.io/guava/releases/19.0/api/docs/com/google/common/base/Preconditions.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org