[jira] [Updated] (YARN-11112) Avoid renewing delegation token when app is first submitted to RM

2022-04-18 Thread Yuanbo Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-2:
--
Attachment: image-2022-04-19-10-38-01-194.png

> Avoid renewing delegation token when app is first submitted to RM
> -
>
> Key: YARN-2
> URL: https://issues.apache.org/jira/browse/YARN-2
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Yuanbo Liu
>Priority: Major
> Attachments: image-2022-04-19-10-34-59-573.png, 
> image-2022-04-19-10-38-01-194.png
>
>
> When auth is enabled by NameNode, then delegation token is required if 
> application needs to acess files/directoies. We find that when app is first 
> submitted to RM, RM renewer will renew app token no matter whether token is 
> expired or not. Renewing token is a bit heavy since it uses global write 
> lock. Here is the result when delegation token is required in a very busy 
> cluster.
> !image-2022-04-19-10-34-59-573.png|width=515,height=302!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11112) Avoid renewing delegation token when app is first submitted to RM

2022-04-18 Thread Yuanbo Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-2:
--
Description: 
When auth is enabled by NameNode, then delegation token is required if 
application needs to acess files/directoies. We find that when app is first 
submitted to RM, RM renewer will renew app token no matter whether token is 
expired or not. Renewing token is a bit heavy since it uses global write lock. 
Here is the result when delegation token is required in a very busy cluster.

!image-2022-04-19-10-34-59-573.png|width=515,height=302!

!image-2022-04-19-10-38-01-194.png|width=490,height=290!

 

  was:
When auth is enabled by NameNode, then delegation token is required if 
application needs to acess files/directoies. We find that when app is first 
submitted to RM, RM renewer will renew app token no matter whether token is 
expired or not. Renewing token is a bit heavy since it uses global write lock. 
Here is the result when delegation token is required in a very busy cluster.


!image-2022-04-19-10-34-59-573.png|width=515,height=302!

 


> Avoid renewing delegation token when app is first submitted to RM
> -
>
> Key: YARN-2
> URL: https://issues.apache.org/jira/browse/YARN-2
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Yuanbo Liu
>Priority: Major
> Attachments: image-2022-04-19-10-34-59-573.png, 
> image-2022-04-19-10-38-01-194.png
>
>
> When auth is enabled by NameNode, then delegation token is required if 
> application needs to acess files/directoies. We find that when app is first 
> submitted to RM, RM renewer will renew app token no matter whether token is 
> expired or not. Renewing token is a bit heavy since it uses global write 
> lock. Here is the result when delegation token is required in a very busy 
> cluster.
> !image-2022-04-19-10-34-59-573.png|width=515,height=302!
> !image-2022-04-19-10-38-01-194.png|width=490,height=290!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11112) Avoid renewing delegation token when app is first submitted to RM

2022-04-18 Thread Yuanbo Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-2:
--
Issue Type: Improvement  (was: Bug)

> Avoid renewing delegation token when app is first submitted to RM
> -
>
> Key: YARN-2
> URL: https://issues.apache.org/jira/browse/YARN-2
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Yuanbo Liu
>Priority: Major
> Attachments: image-2022-04-19-10-34-59-573.png
>
>
> When auth is enabled by NameNode, then delegation token is required if 
> application needs to acess files/directoies. We find that when app is first 
> submitted to RM, RM renewer will renew app token no matter whether token is 
> expired or not. Renewing token is a bit heavy since it uses global write 
> lock. Here is the result when delegation token is required in a very busy 
> cluster.
> !image-2022-04-19-10-34-59-573.png|width=515,height=302!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11112) Avoid renewing delegation token when app is first submitted to RM

2022-04-18 Thread Yuanbo Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-2:
--
Description: 
When auth is enabled by NameNode, then delegation token is required if 
application needs to acess files/directoies. We find that when app is first 
submitted to RM, RM renewer will renew app token no matter whether token is 
expired or not. Renewing token is a bit heavy since it uses global write lock. 
Here is the result when delegation token is required in a very busy cluster.


!image-2022-04-19-10-34-59-573.png|width=515,height=302!

 

  was:
When auth is enabled by NameNode, then delegation token is required if 
application needs to acess files/directoies. We find that when app is first 
submitted to RM, RM renewer will renew app token no matter whether token is 
expired or not. Renewing token is a bit heavy since it uses global write lock. 
Here is the result when delegation token is required in a very busy cluster.
!image-2022-04-19-10-34-59-573.png!

 


> Avoid renewing delegation token when app is first submitted to RM
> -
>
> Key: YARN-2
> URL: https://issues.apache.org/jira/browse/YARN-2
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yuanbo Liu
>Priority: Major
> Attachments: image-2022-04-19-10-34-59-573.png
>
>
> When auth is enabled by NameNode, then delegation token is required if 
> application needs to acess files/directoies. We find that when app is first 
> submitted to RM, RM renewer will renew app token no matter whether token is 
> expired or not. Renewing token is a bit heavy since it uses global write 
> lock. Here is the result when delegation token is required in a very busy 
> cluster.
> !image-2022-04-19-10-34-59-573.png|width=515,height=302!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11112) Avoid renewing delegation token when app is first submitted to RM

2022-04-18 Thread Yuanbo Liu (Jira)
Yuanbo Liu created YARN-2:
-

 Summary: Avoid renewing delegation token when app is first 
submitted to RM
 Key: YARN-2
 URL: https://issues.apache.org/jira/browse/YARN-2
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yuanbo Liu
 Attachments: image-2022-04-19-10-34-59-573.png

When auth is enabled by NameNode, then delegation token is required if 
application needs to acess files/directoies. We find that when app is first 
submitted to RM, RM renewer will renew app token no matter whether token is 
expired or not. Renewing token is a bit heavy since it uses global write lock. 
Here is the result when delegation token is required in a very busy cluster.
!image-2022-04-19-10-34-59-573.png!

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-04 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17207806#comment-17207806
 ] 

Yuanbo Liu commented on YARN-10393:
---

+1

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)

[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-11 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194592#comment-17194592
 ] 

Yuanbo Liu edited comment on YARN-10393 at 9/12/20, 3:48 AM:
-

[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing, 
which is not correct.  Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea (or we can introduce max 
retry times )?

 


was (Author: yuanbo):
[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing, 
which is not correct.  Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea?

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 

[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-11 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194592#comment-17194592
 ] 

Yuanbo Liu edited comment on YARN-10393 at 9/12/20, 3:26 AM:
-

[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing, 
which is not correct.  Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea?

 


was (Author: yuanbo):
[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing 
which is not correct.  Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea?

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> 

[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-11 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194592#comment-17194592
 ] 

Yuanbo Liu edited comment on YARN-10393 at 9/12/20, 3:25 AM:
-

[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing 
which is not correct.  Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea?

 


was (Author: yuanbo):
[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing.  
Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea?

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-11 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194592#comment-17194592
 ] 

Yuanbo Liu commented on YARN-10393:
---

[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing.  
Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea?

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-10 Thread Yuanbo Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-10393:
--
Attachment: YARN-10393.draft.patch

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> at 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-10 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193975#comment-17193975
 ] 

Yuanbo Liu commented on YARN-10393:
---

Also, we should avoid adding new container to the same heartBeat id as 
[~wzzdreamer] has clarified in the descrption. Resending old containers is not 
avoidable and changing protocol is not a good idea, so we could use  
pendingCompletedContainers to fix it.
I've attached a draft patch for this issue so that we can speed up and conclude 
our ideas. [~wzzdreamer] feel free to attach a new pr if you have it.

[~wzzdreamer]  [~Jim_Brennan] [~adam.antal] 
Any comment will be welcome. 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-01 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188358#comment-17188358
 ] 

Yuanbo Liu commented on YARN-10393:
---

[~adam.antal] Thanks for your comments. I misunderstood [~Jim_Brennan] 's 
solution.  
+1 for that. 

BTW, the normal calculation between requestID and responseID would be 

responseID == requestID + 1
{code:java}
// code placeholder
if (responseId != lastHeartbeatID + 1) {
   pendingCompletedContainers.clear();
}
{code}
correct me if I'm wrong.

 

[~wzzdreamer] What's your thoughts about the solution?

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-01 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188276#comment-17188276
 ] 

Yuanbo Liu commented on YARN-10393:
---

[~adam.antal] sorry to interrupt, any thoughts about this issue?

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>   

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-01 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188270#comment-17188270
 ] 

Yuanbo Liu commented on YARN-10393:
---

[~Jim_Brennan] Thanks for the reply.

Basically I'm opening to this two options:

1、Change the code of sending heartbeat, send it again if response not received 
(like [~wzzdreamer] did in the pr), but I'd prefer to introduce some max-retry 
code in case of potential infinit loop.

2、Resending container id from recentlyStoppedContainers periodically (maybe 1 
mins?), once it get response from RM, it will be removed from 
recentlyStoppedContainers and never get retried again.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-19 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180959#comment-17180959
 ] 

Yuanbo Liu commented on YARN-10393:
---

[~Jim_Brennan]

Thanks for the comments.

> I think the concern is that if we remove that 
>pendingCompletedContainers.clear()

This would be a potential memory leak if we remove 
"pendingCompletedContainers.clear()".
I'd suggest that removing "!isContainerRecentlyStopped(containerId)" in 
NodeStatusUpdaterImpl.java[line: 613] would be good to fix this issue.
{code:java}
if (!isContainerRecentlyStopped(containerId)) {
 pendingCompletedContainers.put(containerId, containerStatus);
}{code}
Completed containers will be cached in 10mins(default value) until it timeouts 
or gets response from heartbeat. And 10mins cache for completed container is 
long enough for retrying sending requests through heartbeat (default interval 
is 10s).

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-13 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176820#comment-17176820
 ] 

Yuanbo Liu commented on YARN-10393:
---

[~wzzdreamer] Thanks for your patch, I've made some comments about your patch.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-10 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175148#comment-17175148
 ] 

Yuanbo Liu commented on YARN-10393:
---

Thanks for opening this issue, we happened to get similar situation on 
hadoop-2.7.0. The mapper lost heartbeat and nerver finish. Currently we just 
use "mapred fail-task" to make those mappers in failure state, and re-execute 
those mappers again. Looking forward to your patch!

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at 

[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-08-10 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174287#comment-17174287
 ] 

Yuanbo Liu commented on YARN-10380:
---

[~wangda]

Thanks for opening this issue,

Not sure whether you're working on it. I'd glad to help on it.

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Priority: Critical
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-08-07 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172930#comment-17172930
 ] 

Yuanbo Liu commented on YARN-10380:
---

Seems multi-node allocation only works on reserved container assignment.
we can reorg those code to improve assigment speed.

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Priority: Critical
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6325) ParentQueue and LeafQueue with same name can cause queue name based operations to fail

2019-04-24 Thread Yuanbo Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu reassigned YARN-6325:


Assignee: Yuanbo Liu

> ParentQueue and LeafQueue with same name can cause queue name based 
> operations to fail
> --
>
> Key: YARN-6325
> URL: https://issues.apache.org/jira/browse/YARN-6325
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Jonathan Hung
>Assignee: Yuanbo Liu
>Priority: Major
> Attachments: Screen Shot 2017-03-13 at 2.28.30 PM.png, 
> capacity-scheduler.xml
>
>
> For example, configure capacity scheduler with two leaf queues: {{root.a.a1}} 
> and {{root.b.a}}, with {{yarn.scheduler.capacity.root.queues}} as {{b,a}} (in 
> that order).
> Then add a mapping e.g. {{u:username:a}} to {{capacity-scheduler.xml}} and 
> call {{refreshQueues}}. Operation fails with {noformat}refreshQueues: 
> java.io.IOException: Failed to re-init queues : mapping contains invalid or 
> non-leaf queue a
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.logAndWrapException(AdminService.java:866)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:391)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114)
>   at 
> org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:522)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:867)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:813)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2653)
> Caused by: java.io.IOException: Failed to re-init queues : mapping contains 
> invalid or non-leaf queue a
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:404)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:396)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:386)
>   ... 10 more
> Caused by: java.io.IOException: mapping contains invalid or non-leaf queue a
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getUserGroupMappingPlacementRule(CapacityScheduler.java:547)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updatePlacementRules(CapacityScheduler.java:571)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:595)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:400)
>   ... 12 more
> {noformat}
> Part of the issue is that the {{queues}} map in 
> {{CapacitySchedulerQueueManager}} stores queues by queue name. We could do 
> one of a few things:
> # Disallow ParentQueues and LeafQueues to have the same queue name. (this 
> breaks compatibility)
> # Store queues by queue path instead of queue name. But this might require 
> changes in lots of places, e.g. in this case the queue-mappings would have to 
> map to a queue path instead of a queue name (which also breaks compatibility)
> and possibly others.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6325) ParentQueue and LeafQueue with same name can cause queue name based operations to fail

2019-04-24 Thread Yuanbo Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825656#comment-16825656
 ] 

Yuanbo Liu commented on YARN-6325:
--

[~leftnoteasy] we have such kind of issue in our environment. I'd like to patch 
it. Any further comment will be welcome.

> ParentQueue and LeafQueue with same name can cause queue name based 
> operations to fail
> --
>
> Key: YARN-6325
> URL: https://issues.apache.org/jira/browse/YARN-6325
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Jonathan Hung
>Priority: Major
> Attachments: Screen Shot 2017-03-13 at 2.28.30 PM.png, 
> capacity-scheduler.xml
>
>
> For example, configure capacity scheduler with two leaf queues: {{root.a.a1}} 
> and {{root.b.a}}, with {{yarn.scheduler.capacity.root.queues}} as {{b,a}} (in 
> that order).
> Then add a mapping e.g. {{u:username:a}} to {{capacity-scheduler.xml}} and 
> call {{refreshQueues}}. Operation fails with {noformat}refreshQueues: 
> java.io.IOException: Failed to re-init queues : mapping contains invalid or 
> non-leaf queue a
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.logAndWrapException(AdminService.java:866)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:391)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114)
>   at 
> org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:522)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:867)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:813)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2653)
> Caused by: java.io.IOException: Failed to re-init queues : mapping contains 
> invalid or non-leaf queue a
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:404)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:396)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:386)
>   ... 10 more
> Caused by: java.io.IOException: mapping contains invalid or non-leaf queue a
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getUserGroupMappingPlacementRule(CapacityScheduler.java:547)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updatePlacementRules(CapacityScheduler.java:571)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:595)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:400)
>   ... 12 more
> {noformat}
> Part of the issue is that the {{queues}} map in 
> {{CapacitySchedulerQueueManager}} stores queues by queue name. We could do 
> one of a few things:
> # Disallow ParentQueues and LeafQueues to have the same queue name. (this 
> breaks compatibility)
> # Store queues by queue path instead of queue name. But this might require 
> changes in lots of places, e.g. in this case the queue-mappings would have to 
> map to a queue path instead of a queue name (which also breaks compatibility)
> and possibly others.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized

2018-07-21 Thread Yuanbo Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551551#comment-16551551
 ] 

Yuanbo Liu commented on YARN-8513:
--

Sorry for the late response. Quite busy this week. I will go through the dump 
files today

> CapacityScheduler infinite loop when queue is near fully utilized
> -
>
> Key: YARN-8513
> URL: https://issues.apache.org/jira/browse/YARN-8513
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.1
> Environment: Ubuntu 14.04.5
> YARN is configured with one label and 5 queues.
>Reporter: Chen Yufei
>Priority: Major
> Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, 
> jstack-5.log, top-during-lock.log, top-when-normal.log
>
>
> ResourceManager does not respond to any request when queue is near fully 
> utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM 
> restart, it can recover running jobs and start accepting new ones.
>  
> Seems like CapacityScheduler is in an infinite loop printing out the 
> following log messages (more than 25,000 lines in a second):
>  
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.99816763 
> absoluteUsedCapacity=0.99816763 used= 
> cluster=}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1530619767030_1652_01 
> container=null 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943
>  clusterResource= type=NODE_LOCAL 
> requestedPartition=}}
>  
> I encounter this problem several times after upgrading to YARN 2.9.1, while 
> the same configuration works fine under version 2.7.3.
>  
> YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a 
> similar problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6482) TestSLSRunner runs but doesn't executed jobs (.json parsing issue)

2018-07-14 Thread Yuanbo Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544074#comment-16544074
 ] 

Yuanbo Liu edited comment on YARN-6482 at 7/14/18 6:09 AM:
---

The nodes in the rumen file are not correct. attach v1 patch to fix this issue.
 [~djp] / [~cheersyang] Please take a look if you get time. Thanks in advance.


was (Author: yuanbo):
The nodes in the rumen file are not correct. attach v1 patch to fix this issue.
[~djp] Please take a look if you get time. Thanks in advance.

> TestSLSRunner runs but doesn't executed jobs (.json parsing issue)
> --
>
> Key: YARN-6482
> URL: https://issues.apache.org/jira/browse/YARN-6482
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Carlo Curino
>Assignee: Yuanbo Liu
>Priority: Minor
> Attachments: YARN-6482.001.patch
>
>
> The TestSLSRunner runs correctly brining up and RM, but the parsing of the 
> rumen trace fails somehow silently, and no nodes nor jobs are loaded. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6482) TestSLSRunner runs but doesn't executed jobs (.json parsing issue)

2018-07-14 Thread Yuanbo Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544074#comment-16544074
 ] 

Yuanbo Liu commented on YARN-6482:
--

The nodes in the rumen file are not correct. attach v1 patch to fix this issue.
[~djp] Please take a look if you get time. Thanks in advance.

> TestSLSRunner runs but doesn't executed jobs (.json parsing issue)
> --
>
> Key: YARN-6482
> URL: https://issues.apache.org/jira/browse/YARN-6482
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Carlo Curino
>Assignee: Yuanbo Liu
>Priority: Minor
> Attachments: YARN-6482.001.patch
>
>
> The TestSLSRunner runs correctly brining up and RM, but the parsing of the 
> rumen trace fails somehow silently, and no nodes nor jobs are loaded. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6482) TestSLSRunner runs but doesn't executed jobs (.json parsing issue)

2018-07-14 Thread Yuanbo Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-6482:
-
Attachment: YARN-6482.001.patch

> TestSLSRunner runs but doesn't executed jobs (.json parsing issue)
> --
>
> Key: YARN-6482
> URL: https://issues.apache.org/jira/browse/YARN-6482
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Carlo Curino
>Assignee: Yuanbo Liu
>Priority: Minor
> Attachments: YARN-6482.001.patch
>
>
> The TestSLSRunner runs correctly brining up and RM, but the parsing of the 
> rumen trace fails somehow silently, and no nodes nor jobs are loaded. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8510) Timezone offset in YARN UI does not format minutes component correctly.

2018-07-13 Thread Yuanbo Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542729#comment-16542729
 ] 

Yuanbo Liu commented on YARN-8510:
--

[~tmoschou]  Can you give the snapshot of the YARN UI so that we can address 
the issue quickly

> Timezone offset in YARN UI does not format minutes component correctly.
> ---
>
> Key: YARN-8510
> URL: https://issues.apache.org/jira/browse/YARN-8510
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.4
>Reporter: Terry Moschou
>Priority: Trivial
>
> Offsets with a non-zero mm component like +9.5hr (ACST) are formatted as 
> +0950 rather than +0930 in the YARN UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized

2018-07-13 Thread Yuanbo Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542719#comment-16542719
 ] 

Yuanbo Liu commented on YARN-8513:
--

[~cyfdecyf] Can you reproduce this issue and capture the stack of RM
 # jstack -F pid
 # top -H -p pid 

then attach those info in this Jira so that we can figure out the cause of 
infinite loop here.

> CapacityScheduler infinite loop when queue is near fully utilized
> -
>
> Key: YARN-8513
> URL: https://issues.apache.org/jira/browse/YARN-8513
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.1
> Environment: Ubuntu 14.04.5
> YARN is configured with one label and 5 queues.
>Reporter: Chen Yufei
>Priority: Major
>
> ResourceManager does not respond to any request when queue is near fully 
> utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM 
> restart, it can recover running jobs and start accepting new ones.
>  
> Seems like CapacityScheduler is in an infinite loop printing out the 
> following log messages (more than 25,000 lines in a second):
>  
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.99816763 
> absoluteUsedCapacity=0.99816763 used= 
> cluster=}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1530619767030_1652_01 
> container=null 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943
>  clusterResource= type=NODE_LOCAL 
> requestedPartition=}}
>  
> I encounter this problem several times after upgrading to YARN 2.9.1, while 
> the same configuration works fine under version 2.7.3.
>  
> YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a 
> similar problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6482) TestSLSRunner runs but doesn't executed jobs (.json parsing issue)

2018-07-09 Thread Yuanbo Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536702#comment-16536702
 ] 

Yuanbo Liu commented on YARN-6482:
--

Sorry, I will update this Jira shortly

> TestSLSRunner runs but doesn't executed jobs (.json parsing issue)
> --
>
> Key: YARN-6482
> URL: https://issues.apache.org/jira/browse/YARN-6482
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Carlo Curino
>Assignee: Yuanbo Liu
>Priority: Minor
>
> The TestSLSRunner runs correctly brining up and RM, but the parsing of the 
> rumen trace fails somehow silently, and no nodes nor jobs are loaded. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently

2018-07-01 Thread Yuanbo Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-6265:
-
Attachment: YARN-6265.003.patch

> yarn.resourcemanager.fail-fast is used inconsistently
> -
>
> Key: YARN-6265
> URL: https://issues.apache.org/jira/browse/YARN-6265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Yuanbo Liu
>Priority: Major
> Attachments: YARN-6265.001.patch, YARN-6265.002.patch, 
> YARN-6265.003.patch
>
>
> In capacity scheduler, the property is used to control whether an app with 
> no/bad queue should be killed.  In the state store, the property controls 
> whether a state store op failure should cause the RM to exit in non-HA mode.  
> Those are two very different things, and they should be separated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently

2018-07-01 Thread Yuanbo Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528987#comment-16528987
 ] 

Yuanbo Liu commented on YARN-6265:
--

rebase the patch.

> yarn.resourcemanager.fail-fast is used inconsistently
> -
>
> Key: YARN-6265
> URL: https://issues.apache.org/jira/browse/YARN-6265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Yuanbo Liu
>Priority: Major
> Attachments: YARN-6265.001.patch, YARN-6265.002.patch, 
> YARN-6265.003.patch
>
>
> In capacity scheduler, the property is used to control whether an app with 
> no/bad queue should be killed.  In the state store, the property controls 
> whether a state store op failure should cause the RM to exit in non-HA mode.  
> Those are two very different things, and they should be separated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6482) TestSLSRunner runs but doesn't executed jobs (.json parsing issue)

2017-04-18 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu reassigned YARN-6482:


Assignee: Yuanbo Liu

> TestSLSRunner runs but doesn't executed jobs (.json parsing issue)
> --
>
> Key: YARN-6482
> URL: https://issues.apache.org/jira/browse/YARN-6482
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Carlo Curino
>Assignee: Yuanbo Liu
>Priority: Minor
>
> The TestSLSRunner runs correctly brining up and RM, but the parsing of the 
> rumen trace fails somehow silently, and no nodes nor jobs are loaded. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6482) TestSLSRunner runs but doesn't executed jobs (.json parsing issue)

2017-04-18 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973987#comment-15973987
 ] 

Yuanbo Liu commented on YARN-6482:
--

Take it over, this defect is introduced by YARN-4612. 

> TestSLSRunner runs but doesn't executed jobs (.json parsing issue)
> --
>
> Key: YARN-6482
> URL: https://issues.apache.org/jira/browse/YARN-6482
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Carlo Curino
>Priority: Minor
>
> The TestSLSRunner runs correctly brining up and RM, but the parsing of the 
> rumen trace fails somehow silently, and no nodes nor jobs are loaded. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently

2017-04-12 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967164#comment-15967164
 ] 

Yuanbo Liu commented on YARN-6265:
--

The test failures are not related.

> yarn.resourcemanager.fail-fast is used inconsistently
> -
>
> Key: YARN-6265
> URL: https://issues.apache.org/jira/browse/YARN-6265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Yuanbo Liu
> Attachments: YARN-6265.001.patch, YARN-6265.002.patch
>
>
> In capacity scheduler, the property is used to control whether an app with 
> no/bad queue should be killed.  In the state store, the property controls 
> whether a state store op failure should cause the RM to exit in non-HA mode.  
> Those are two very different things, and they should be separated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently

2017-04-06 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958341#comment-15958341
 ] 

Yuanbo Liu commented on YARN-6265:
--

[~djp] and [~templedf] Thanks for your review.
Upload v2 patch to address your comments.

> yarn.resourcemanager.fail-fast is used inconsistently
> -
>
> Key: YARN-6265
> URL: https://issues.apache.org/jira/browse/YARN-6265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Yuanbo Liu
> Attachments: YARN-6265.001.patch, YARN-6265.002.patch
>
>
> In capacity scheduler, the property is used to control whether an app with 
> no/bad queue should be killed.  In the state store, the property controls 
> whether a state store op failure should cause the RM to exit in non-HA mode.  
> Those are two very different things, and they should be separated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently

2017-04-06 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-6265:
-
Attachment: YARN-6265.002.patch

> yarn.resourcemanager.fail-fast is used inconsistently
> -
>
> Key: YARN-6265
> URL: https://issues.apache.org/jira/browse/YARN-6265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Yuanbo Liu
> Attachments: YARN-6265.001.patch, YARN-6265.002.patch
>
>
> In capacity scheduler, the property is used to control whether an app with 
> no/bad queue should be killed.  In the state store, the property controls 
> whether a state store op failure should cause the RM to exit in non-HA mode.  
> Those are two very different things, and they should be separated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently

2017-03-22 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937643#comment-15937643
 ] 

Yuanbo Liu commented on YARN-6265:
--

Sorry to interrupt, it has been a while since the last update.
It would be great if somebody looks into my patch and give some thoughts. 
Thanks in advance.

> yarn.resourcemanager.fail-fast is used inconsistently
> -
>
> Key: YARN-6265
> URL: https://issues.apache.org/jira/browse/YARN-6265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Yuanbo Liu
> Attachments: YARN-6265.001.patch
>
>
> In capacity scheduler, the property is used to control whether an app with 
> no/bad queue should be killed.  In the state store, the property controls 
> whether a state store op failure should cause the RM to exit in non-HA mode.  
> Those are two very different things, and they should be separated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6341) Redirected tracking UI of application is not correct if web policy is transformed from HTTP_ONLY to HTTPS_ONLY

2017-03-15 Thread Yuanbo Liu (JIRA)
Yuanbo Liu created YARN-6341:


 Summary: Redirected tracking UI of application is not correct if 
web policy is transformed from HTTP_ONLY to HTTPS_ONLY
 Key: YARN-6341
 URL: https://issues.apache.org/jira/browse/YARN-6341
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yuanbo Liu


Before users enable hadoop https, they submit a MR job. After the job is 
finished and web policy is configured as HTTPS_ONLY, users access as following 
steps:
Resource Manager UI -> Applications -> Tracking UI
then the address is redirected into a http address of job history server 
instead of https address. I think this behavior is related to 
{{WebAppProxyServlet#getTrackingUri}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently

2017-03-14 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925400#comment-15925400
 ] 

Yuanbo Liu commented on YARN-6265:
--

[~djp] Thanks

> yarn.resourcemanager.fail-fast is used inconsistently
> -
>
> Key: YARN-6265
> URL: https://issues.apache.org/jira/browse/YARN-6265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Yuanbo Liu
> Attachments: YARN-6265.001.patch
>
>
> In capacity scheduler, the property is used to control whether an app with 
> no/bad queue should be killed.  In the state store, the property controls 
> whether a state store op failure should cause the RM to exit in non-HA mode.  
> Those are two very different things, and they should be separated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently

2017-03-14 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925368#comment-15925368
 ] 

Yuanbo Liu commented on YARN-6265:
--

[~templedf] / [~djp] Does jenkins have any issue? I submitted the patch, and 
don't get the result report.

> yarn.resourcemanager.fail-fast is used inconsistently
> -
>
> Key: YARN-6265
> URL: https://issues.apache.org/jira/browse/YARN-6265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Yuanbo Liu
> Attachments: YARN-6265.001.patch
>
>
> In capacity scheduler, the property is used to control whether an app with 
> no/bad queue should be killed.  In the state store, the property controls 
> whether a state store op failure should cause the RM to exit in non-HA mode.  
> Those are two very different things, and they should be separated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently

2017-03-14 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-6265:
-
Attachment: YARN-6265.001.patch

upload v1 patch for this JIRA.

> yarn.resourcemanager.fail-fast is used inconsistently
> -
>
> Key: YARN-6265
> URL: https://issues.apache.org/jira/browse/YARN-6265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Yuanbo Liu
> Attachments: YARN-6265.001.patch
>
>
> In capacity scheduler, the property is used to control whether an app with 
> no/bad queue should be killed.  In the state store, the property controls 
> whether a state store op failure should cause the RM to exit in non-HA mode.  
> Those are two very different things, and they should be separated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6265) yarn.resourcemanager.fail-fast is used inconsistently

2017-03-14 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu reassigned YARN-6265:


Assignee: Yuanbo Liu

> yarn.resourcemanager.fail-fast is used inconsistently
> -
>
> Key: YARN-6265
> URL: https://issues.apache.org/jira/browse/YARN-6265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Yuanbo Liu
>
> In capacity scheduler, the property is used to control whether an app with 
> no/bad queue should be killed.  In the state store, the property controls 
> whether a state store op failure should cause the RM to exit in non-HA mode.  
> Those are two very different things, and they should be separated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6300) NULL_UPDATE_REQUESTS is redundant in TestFairScheduler

2017-03-09 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904329#comment-15904329
 ] 

Yuanbo Liu commented on YARN-6300:
--

[~templedf] Thanks for your commit
!Selection_124.png!
I've seen this patch in branch-2, so I guess I don't need to provide branch-2 
patch any more, right?

> NULL_UPDATE_REQUESTS is redundant in TestFairScheduler
> --
>
> Key: YARN-6300
> URL: https://issues.apache.org/jira/browse/YARN-6300
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha2
>Reporter: Daniel Templeton
>Assignee: Yuanbo Liu
>Priority: Minor
>  Labels: newbie
> Fix For: 3.0.0-alpha3
>
> Attachments: Selection_124.png, YARN-6300.001.patch
>
>
> The {{TestFairScheduler.NULL_UPDATE_REQUESTS}} field hides 
> {{FairSchedulerTestBase.NULL_UPDATE_REQUESTS}}, which has the same value.  
> The {{NULL_UPDATE_REQUESTS}} field should be removed from 
> {{TestFairScheduler}}.
> While you're at it, maybe also remove the unused import.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6300) NULL_UPDATE_REQUESTS is redundant in TestFairScheduler

2017-03-09 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-6300:
-
Attachment: Selection_124.png

> NULL_UPDATE_REQUESTS is redundant in TestFairScheduler
> --
>
> Key: YARN-6300
> URL: https://issues.apache.org/jira/browse/YARN-6300
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha2
>Reporter: Daniel Templeton
>Assignee: Yuanbo Liu
>Priority: Minor
>  Labels: newbie
> Fix For: 3.0.0-alpha3
>
> Attachments: Selection_124.png, YARN-6300.001.patch
>
>
> The {{TestFairScheduler.NULL_UPDATE_REQUESTS}} field hides 
> {{FairSchedulerTestBase.NULL_UPDATE_REQUESTS}}, which has the same value.  
> The {{NULL_UPDATE_REQUESTS}} field should be removed from 
> {{TestFairScheduler}}.
> While you're at it, maybe also remove the unused import.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6300) NULL_UPDATE_REQUESTS is redundant in TestFairScheduler

2017-03-08 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902391#comment-15902391
 ] 

Yuanbo Liu commented on YARN-6300:
--

[~haibochen] Thanks for your review.

> NULL_UPDATE_REQUESTS is redundant in TestFairScheduler
> --
>
> Key: YARN-6300
> URL: https://issues.apache.org/jira/browse/YARN-6300
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha2
>Reporter: Daniel Templeton
>Assignee: Yuanbo Liu
>Priority: Minor
>  Labels: newbie
> Attachments: YARN-6300.001.patch
>
>
> The {{TestFairScheduler.NULL_UPDATE_REQUESTS}} field hides 
> {{FairSchedulerTestBase.NULL_UPDATE_REQUESTS}}, which has the same value.  
> The {{NULL_UPDATE_REQUESTS}} field should be removed from 
> {{TestFairScheduler}}.
> While you're at it, maybe also remove the unused import.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6300) NULL_UPDATE_REQUESTS is redundant in TestFairScheduler

2017-03-08 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-6300:
-
Attachment: YARN-6300.001.patch

upload v1 patch for this JIRA.

> NULL_UPDATE_REQUESTS is redundant in TestFairScheduler
> --
>
> Key: YARN-6300
> URL: https://issues.apache.org/jira/browse/YARN-6300
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha2
>Reporter: Daniel Templeton
>Assignee: Yuanbo Liu
>Priority: Minor
>  Labels: newbie
> Attachments: YARN-6300.001.patch
>
>
> The {{TestFairScheduler.NULL_UPDATE_REQUESTS}} field hides 
> {{FairSchedulerTestBase.NULL_UPDATE_REQUESTS}}, which has the same value.  
> The {{NULL_UPDATE_REQUESTS}} field should be removed from 
> {{TestFairScheduler}}.
> While you're at it, maybe also remove the unused import.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6300) NULL_UPDATE_REQUESTS is redundant in TestFairScheduler

2017-03-08 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu reassigned YARN-6300:


Assignee: Yuanbo Liu

> NULL_UPDATE_REQUESTS is redundant in TestFairScheduler
> --
>
> Key: YARN-6300
> URL: https://issues.apache.org/jira/browse/YARN-6300
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha2
>Reporter: Daniel Templeton
>Assignee: Yuanbo Liu
>Priority: Minor
>  Labels: newbie
>
> The {{TestFairScheduler.NULL_UPDATE_REQUESTS}} field hides 
> {{FairSchedulerTestBase.NULL_UPDATE_REQUESTS}}, which has the same value.  
> The {{NULL_UPDATE_REQUESTS}} field should be removed from 
> {{TestFairScheduler}}.
> While you're at it, maybe also remove the unused import.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1728) Workaround guice3x-undecoded pathInfo in YARN WebApp

2017-02-28 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889359#comment-15889359
 ] 

Yuanbo Liu commented on YARN-1728:
--

Thanks a lot!

> Workaround guice3x-undecoded pathInfo in YARN WebApp
> 
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha3
>
> Attachments: test-case-for-trunk.patch, YARN-1728-branch-2.001.patch, 
> YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, 
> YARN-1728-branch-2.004.patch, YARN-1728-branch-2.005.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-27 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887139#comment-15887139
 ] 

Yuanbo Liu commented on YARN-1728:
--

[~jira.shegalov] Thanks for your comments.
Upload v5 patch and test case for trunk to address your comments.

> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Attachments: test-case-for-trunk.patch, YARN-1728-branch-2.001.patch, 
> YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, 
> YARN-1728-branch-2.004.patch, YARN-1728-branch-2.005.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-27 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-1728:
-
Attachment: test-case-for-trunk.patch

> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Attachments: test-case-for-trunk.patch, YARN-1728-branch-2.001.patch, 
> YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, 
> YARN-1728-branch-2.004.patch, YARN-1728-branch-2.005.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-27 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-1728:
-
Attachment: YARN-1728-branch-2.005.patch

> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Attachments: YARN-1728-branch-2.001.patch, 
> YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, 
> YARN-1728-branch-2.004.patch, YARN-1728-branch-2.005.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-26 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-1728:
-
Attachment: YARN-1728-branch-2.004.patch

> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Attachments: YARN-1728-branch-2.001.patch, 
> YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch, 
> YARN-1728-branch-2.004.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-26 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885088#comment-15885088
 ] 

Yuanbo Liu commented on YARN-1728:
--

[~jira.shegalov] Thanks for your comments.
Agree with you. since the patch of guice uses {{pathInfo = new 
URI(pathInfo).getPath()}} with try.. catch expression, and {{URI.create}} will 
raise a run-time exception, I use {{new URI}} instead of {{URI.create}} for 
compatibility. Upload v4 patch for this JIRA, please review it.

> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Attachments: YARN-1728-branch-2.001.patch, 
> YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-23 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-1728:
-
Attachment: YARN-1728-branch-2.003.patch

[~haibochen] Thanks a lot for your comments.
Upload v3 patch.

> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Attachments: YARN-1728-branch-2.001.patch, 
> YARN-1728-branch-2.002.patch, YARN-1728-branch-2.003.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-22 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879972#comment-15879972
 ] 

Yuanbo Liu commented on YARN-1728:
--

[~haibochen] Thanks for your review.
Upload v2 patch to address your comments.

> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Attachments: YARN-1728-branch-2.001.patch, 
> YARN-1728-branch-2.002.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-22 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-1728:
-
Attachment: YARN-1728-branch-2.002.patch

> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Attachments: YARN-1728-branch-2.001.patch, 
> YARN-1728-branch-2.002.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-17 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871448#comment-15871448
 ] 

Yuanbo Liu commented on YARN-1728:
--

[~jira.shegalov] / [~rkanter]
Would you mind having a look at my patch and give some thoughts. Thanks in 
advance!

> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Attachments: YARN-1728-branch-2.001.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-17 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-1728:
-
Attachment: YARN-1728-branch-2.001.patch

The root cause of this defect is that the third-party jar named guice-3.0 
doesn't obey the rule which is introduced by [~jira.shegalov]
With HADOOP-12064, this defect will never be a problem in trunk and hadoop-3, 
but in branch-2, it still exists.
I strongly suggest that this defect should be addressed in hadoop-2 because 
encoded path in url is quite often and the implemented method in guice-3.0 
doesn't follow the contract of the interface {{HttpServletRequest#getPathInfo}}.

> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
> Attachments: YARN-1728-branch-2.001.patch
>
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-16 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871131#comment-15871131
 ] 

Yuanbo Liu commented on YARN-1728:
--

Would like to take it over, and see why the path is not decoded. We've met such 
kind of defect in hadoop-2.7.3.

> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-1728) History server doesn't understand percent encoded paths

2017-02-16 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu reassigned YARN-1728:


Assignee: Yuanbo Liu

> History server doesn't understand percent encoded paths
> ---
>
> Key: YARN-1728
> URL: https://issues.apache.org/jira/browse/YARN-1728
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abraham Elmahrek
>Assignee: Yuanbo Liu
>
> For example, going to the job history server page 
> http://localhost:19888/jobhistory/logs/localhost%3A8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
>  results in the following error:
> {code}
> Cannot get container logs. Invalid nodeId: 
> test-cdh5-hue.ent.cloudera.com%3A8041
> {code}
> Where the url decoded version works:
> http://localhost:19888/jobhistory/logs/localhost:8041/container_1391466602060_0011_01_01/job_1391466602060_0011/admin/stderr
> It seems like both should be supported as the former is simply percent 
> encoding.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6073) Misuse of format specifier in Preconditions.checkArgument

2017-01-09 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15814032#comment-15814032
 ] 

Yuanbo Liu commented on YARN-6073:
--

[~templedf] Thanks for your review.

> Misuse of format specifier in Preconditions.checkArgument
> -
>
> Key: YARN-6073
> URL: https://issues.apache.org/jira/browse/YARN-6073
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yongjun Zhang
>Assignee: Yuanbo Liu
>Priority: Trivial
> Attachments: YARN-6073.001.patch
>
>
> RMAdminCLI.java
> {code}
>  int nLabels = map.get(nodeId).size();
>   Preconditions.checkArgument(nLabels <= 1, "%d labels specified on 
> host=%s"
>   + ", please note that we do not support specifying multiple"
>   + " labels on a single host for now.", nLabels, nodeIdStr);
> {code}
> The {{%d}} should be replaced with {{%s}}, per
> https://google.github.io/guava/releases/19.0/api/docs/com/google/common/base/Preconditions.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6073) Misuse of format specifier in Preconditions.checkArgument

2017-01-08 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated YARN-6073:
-
Attachment: YARN-6073.001.patch

upload v1 patch for this JIRA.

> Misuse of format specifier in Preconditions.checkArgument
> -
>
> Key: YARN-6073
> URL: https://issues.apache.org/jira/browse/YARN-6073
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yongjun Zhang
>Assignee: Yuanbo Liu
>Priority: Trivial
> Attachments: YARN-6073.001.patch
>
>
> RMAdminCLI.java
> {code}
>  int nLabels = map.get(nodeId).size();
>   Preconditions.checkArgument(nLabels <= 1, "%d labels specified on 
> host=%s"
>   + ", please note that we do not support specifying multiple"
>   + " labels on a single host for now.", nLabels, nodeIdStr);
> {code}
> The {{%d}} should be replaced with {{%s}}, per
> https://google.github.io/guava/releases/19.0/api/docs/com/google/common/base/Preconditions.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6073) Misuse of format specifier in Preconditions.checkArgument

2017-01-08 Thread Yuanbo Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu reassigned YARN-6073:


Assignee: Yuanbo Liu

> Misuse of format specifier in Preconditions.checkArgument
> -
>
> Key: YARN-6073
> URL: https://issues.apache.org/jira/browse/YARN-6073
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yongjun Zhang
>Assignee: Yuanbo Liu
>Priority: Trivial
>
> RMAdminCLI.java
> {code}
>  int nLabels = map.get(nodeId).size();
>   Preconditions.checkArgument(nLabels <= 1, "%d labels specified on 
> host=%s"
>   + ", please note that we do not support specifying multiple"
>   + " labels on a single host for now.", nLabels, nodeIdStr);
> {code}
> The {{%d}} should be replaced with {{%s}}, per
> https://google.github.io/guava/releases/19.0/api/docs/com/google/common/base/Preconditions.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6073) Misuse of format specifier in Preconditions.checkArgument

2017-01-08 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15810780#comment-15810780
 ] 

Yuanbo Liu commented on YARN-6073:
--

This JIRA can be my good start of first patch for YARN. 
[~yzhangal] Would you mind assigning this JIRA for me, since I don't have the 
privilege to assign YARN JIRA to myself.

> Misuse of format specifier in Preconditions.checkArgument
> -
>
> Key: YARN-6073
> URL: https://issues.apache.org/jira/browse/YARN-6073
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yongjun Zhang
>Priority: Trivial
>
> RMAdminCLI.java
> {code}
>  int nLabels = map.get(nodeId).size();
>   Preconditions.checkArgument(nLabels <= 1, "%d labels specified on 
> host=%s"
>   + ", please note that we do not support specifying multiple"
>   + " labels on a single host for now.", nLabels, nodeIdStr);
> {code}
> The {{%d}} should be replaced with {{%s}}, per
> https://google.github.io/guava/releases/19.0/api/docs/com/google/common/base/Preconditions.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org