[jira] [Commented] (YARN-7556) Fair scheduler configuration should allow resource types in the minResources and maxResources properties

2018-07-07 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535984#comment-16535984
 ] 

Wangda Tan commented on YARN-7556:
--

[~snemeth], please go ahead create the JIRA to track the issue. 

Thanks a lot!

> Fair scheduler configuration should allow resource types in the minResources 
> and maxResources properties
> 
>
> Key: YARN-7556
> URL: https://issues.apache.org/jira/browse/YARN-7556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Affects Versions: 3.0.0-beta1
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: YARN-7556.001.patch, YARN-7556.002.patch, 
> YARN-7556.003.patch, YARN-7556.004.patch, YARN-7556.005.patch, 
> YARN-7556.006.patch, YARN-7556.007.patch, YARN-7556.008.patch, 
> YARN-7556.009.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7556) Fair scheduler configuration should allow resource types in the minResources and maxResources properties

2018-07-05 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533975#comment-16533975
 ] 

Wangda Tan commented on YARN-7556:
--

[~haibochen], [~templedf], [~snemeth], 
I was thinking to post some comments but the patch already went through :).

I don't prefer to change the common classes (like Resource/LightWeightResource) 
to add the functionality of setting all resource types to same value. 

The API looks really confusing: 

We have: 
{code} 
  @Public
  @Stable
  public static Resource newInstance(long memory, int vCores) {
return new LightWeightResource(memory, vCores);
  }
{code} 

But the new added method has the same first parameter signature (long value) 
but completely different semantics. 

If you really want to add to common layer and reused in other places, I suggest 
to add it to Resources with method name like {{createResourceWithSameValue}}.

> Fair scheduler configuration should allow resource types in the minResources 
> and maxResources properties
> 
>
> Key: YARN-7556
> URL: https://issues.apache.org/jira/browse/YARN-7556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Affects Versions: 3.0.0-beta1
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: YARN-7556.001.patch, YARN-7556.002.patch, 
> YARN-7556.003.patch, YARN-7556.004.patch, YARN-7556.005.patch, 
> YARN-7556.006.patch, YARN-7556.007.patch, YARN-7556.008.patch, 
> YARN-7556.009.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8489) Need to support pluggable termination policy for native services

2018-07-03 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8489:
-
Summary: Need to support pluggable termination policy for native services  
(was: Need to support customer termination policy for native services)

> Need to support pluggable termination policy for native services
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated. 
> There're some jobs/services need different policy. For example, if Tensorflow 
> master component terminated (regardless of succeed or finished), we need to 
> terminate whole training job regardless or other states of other components.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2018-07-03 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8193:
-
Fix Version/s: (was: 2.9.0)

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8193-branch-2.9.0-001.patch, YARN-8193.001.patch, 
> YARN-8193.002.patch
>
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2018-07-03 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8193:
-
Fix Version/s: 3.2.0

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8193-branch-2.9.0-001.patch, YARN-8193.001.patch, 
> YARN-8193.002.patch
>
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2018-07-03 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16531749#comment-16531749
 ] 

Wangda Tan commented on YARN-8193:
--

[~elgoiri], I didn't see this patch went into branch-2.9. just removed 2.9.0 
from fix version.

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8193-branch-2.9.0-001.patch, YARN-8193.001.patch, 
> YARN-8193.002.patch
>
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2018-07-03 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8193:
-
Fix Version/s: (was: 3.2.0)

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Critical
> Fix For: 2.9.0, 3.1.1
>
> Attachments: YARN-8193-branch-2.9.0-001.patch, YARN-8193.001.patch, 
> YARN-8193.002.patch
>
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8489) Need to support customer termination policy for native services

2018-07-02 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8489:


 Summary: Need to support customer termination policy for native 
services
 Key: YARN-8489
 URL: https://issues.apache.org/jira/browse/YARN-8489
 Project: Hadoop YARN
  Issue Type: Task
  Components: yarn-native-services
Reporter: Wangda Tan


Existing YARN service support termination policy for different restart 
policies. For example ALWAYS means service will not be terminated. And NEVER 
means if all component terminated, service will be terminated. 

There're some jobs/services need different policy. For example, if Tensorflow 
master component terminated (regardless of succeed or finished), we need to 
terminate whole training job regardless or other states of other components.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8489) Need to support customer termination policy for native services

2018-07-02 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530825#comment-16530825
 ] 

Wangda Tan commented on YARN-8489:
--

cc: [~gsaha], [~csingh], [~billie.rinaldi], [~eyang]

> Need to support customer termination policy for native services
> ---
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated. 
> There're some jobs/services need different policy. For example, if Tensorflow 
> master component terminated (regardless of succeed or finished), we need to 
> terminate whole training job regardless or other states of other components.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8488) Need to add "SUCCEED" state to YARN service

2018-07-02 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8488:
-
Target Version/s: 3.2.0

> Need to add "SUCCEED" state to YARN service
> ---
>
> Key: YARN-8488
> URL: https://issues.apache.org/jira/browse/YARN-8488
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service has following states:
> {code} 
> public enum ServiceState {
>   ACCEPTED, STARTED, STABLE, STOPPED, FAILED, FLEX, UPGRADING,
>   UPGRADING_AUTO_FINALIZE;
> }
> {code} 
> Ideally we should add "SUCCEEDED" state in order to support long running 
> applications like Tensorflow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8488) Need to add "SUCCEED" state to YARN service

2018-07-02 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530821#comment-16530821
 ] 

Wangda Tan commented on YARN-8488:
--

cc: [~gsaha], [~csingh], [~billie.rinaldi], [~eyang]

> Need to add "SUCCEED" state to YARN service
> ---
>
> Key: YARN-8488
> URL: https://issues.apache.org/jira/browse/YARN-8488
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service has following states:
> {code} 
> public enum ServiceState {
>   ACCEPTED, STARTED, STABLE, STOPPED, FAILED, FLEX, UPGRADING,
>   UPGRADING_AUTO_FINALIZE;
> }
> {code} 
> Ideally we should add "SUCCEEDED" state in order to support long running 
> applications like Tensorflow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8488) Need to add "SUCCEED" state to YARN service

2018-07-02 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8488:


 Summary: Need to add "SUCCEED" state to YARN service
 Key: YARN-8488
 URL: https://issues.apache.org/jira/browse/YARN-8488
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Wangda Tan


Existing YARN service has following states:

{code} 
public enum ServiceState {
  ACCEPTED, STARTED, STABLE, STOPPED, FAILED, FLEX, UPGRADING,
  UPGRADING_AUTO_FINALIZE;
}
{code} 

Ideally we should add "SUCCEEDED" state in order to support long running 
applications like Tensorflow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8488) Need to add "SUCCEED" state to YARN service

2018-07-02 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8488:
-
Component/s: yarn-native-services

> Need to add "SUCCEED" state to YARN service
> ---
>
> Key: YARN-8488
> URL: https://issues.apache.org/jira/browse/YARN-8488
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service has following states:
> {code} 
> public enum ServiceState {
>   ACCEPTED, STARTED, STABLE, STOPPED, FAILED, FLEX, UPGRADING,
>   UPGRADING_AUTO_FINALIZE;
> }
> {code} 
> Ideally we should add "SUCCEEDED" state in order to support long running 
> applications like Tensorflow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-07-02 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530233#comment-16530233
 ] 

Wangda Tan commented on YARN-8459:
--

Attached patch (004) which moved re-reservation to debug log, and addressed 
comments from [~bibinchundatt] / [~Tao Yang]. 

Please review and let me know your thoughts.

> Improve logs of Capacity Scheduler to better debug invalid states
> -
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-8459.001.patch, YARN-8459.002.patch, 
> YARN-8459.003.patch, YARN-8459.004.patch
>
>
> Improve logs in CS to better debug invalid states



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-07-02 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8459:
-
Attachment: YARN-8459.004.patch

> Improve logs of Capacity Scheduler to better debug invalid states
> -
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-8459.001.patch, YARN-8459.002.patch, 
> YARN-8459.003.patch, YARN-8459.004.patch
>
>
> Improve logs in CS to better debug invalid states



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2018-06-30 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528957#comment-16528957
 ] 

Wangda Tan commented on YARN-8193:
--

[~elgoiri], Jenkins will be triggered after patch submitted.

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Critical
> Fix For: 2.9.0, 3.2.0, 3.1.1
>
> Attachments: YARN-8193-branch-2.9.0-001.patch, YARN-8193.001.patch, 
> YARN-8193.002.patch
>
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8471) YARN RM hangs and stops allocating resources when applications successively running

2018-06-29 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528545#comment-16528545
 ] 

Wangda Tan commented on YARN-8471:
--

[~jutia], if this is mostly same as YARN-8193, could u reopen YARN-8193 and 
attach patch to there? cc: [~elgoiri]

> YARN RM hangs and stops allocating resources when applications successively 
> running
> ---
>
> Key: YARN-8471
> URL: https://issues.apache.org/jira/browse/YARN-8471
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.0
>Reporter: tianjuan
>Assignee: tianjuan
>Priority: Major
> Fix For: 2.9.0
>
> Attachments: YARN-8471-branch-2.9.0-001.patch, YARN-8471.001.patch
>
>
> At some point RM just hangs and stops allocating resources. At the point RM 
> get hangs, YARN throws NullPointerException at 
> RegularContainerAllocator#allocate, and 
> RegularContainerAllocator#preCheckForPlacementSet, and 
> RegularContainerAllocator#getLocalityWaitFactor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8478) The capacity scheduler logs too frequently seriously affecting performance

2018-06-29 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-8478.
--
Resolution: Duplicate

> The capacity scheduler logs too frequently seriously affecting performance
> --
>
> Key: YARN-8478
> URL: https://issues.apache.org/jira/browse/YARN-8478
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: YunFan Zhou
>Assignee: YunFan Zhou
>Priority: Critical
> Attachments: image-2018-06-29-14-08-50-981.png
>
>
> The capacity scheduler logs too frequently, seriously affecting performance.
> As a result of our test that the scheduling speed of capacity scheduler is 
> difficult to reach 5000/s in the production scenario.
> And it will soon reach the log bottleneck.
> My current work is to change many log levels from INFO to DEBUG level.
> [~wangda] [~leftnoteasy] Any suggestion?
> !image-2018-06-29-14-08-50-981.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8479) The capacity scheduler logs too frequently seriously affecting performance

2018-06-29 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528543#comment-16528543
 ] 

Wangda Tan commented on YARN-8479:
--

Thanks [~daemon], [~cheersyang], 

Basically, inside scheduling logic, I suggest to remove all 
non-allocation/reservation/release scheduler logs to debug. Otherwise when 
async scheduling is enabled, it could be very annoying to see such logs.

The most annoying log to me is. This happens when re-reservation happens. 
Scheduler do lots of re-reservation for the same reserved container, ideally we 
should only log once.

{code} 
2018-06-14 21:49:33,918 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:allocateContainerOnSingleNode(1431)) - Trying to 
fulfill reservation for application application_1527807533249_0089 on node: 
ctr-e138-1518143905142-92974-01-09.hwx.site:45454
2018-06-14 21:49:33,918 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2794)) - Allocation proposal accepted
2018-06-14 21:49:33,918 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(97)) - 
Reserved container  application=application_1527807533249_0089 
resource= The capacity scheduler logs too frequently seriously affecting performance
> --
>
> Key: YARN-8479
> URL: https://issues.apache.org/jira/browse/YARN-8479
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, capacityscheduler
>Reporter: YunFan Zhou
>Assignee: YunFan Zhou
>Priority: Critical
> Attachments: image-2018-06-29-14-16-06-332.png
>
>
> The capacity scheduler logs too frequently, seriously affecting performance.
> As a result of our test that the scheduling speed of capacity scheduler is 
> difficult to reach 5000/s in the production scenario.
> And it will soon reach the log bottleneck.
> My current work is to change many log levels from INFO to DEBUG level.
> [~wangda] [~leftnoteasy] Any suggestion?
>  
> !image-2018-06-29-14-16-06-332.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-29 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528540#comment-16528540
 ] 

Wangda Tan commented on YARN-8459:
--

Thanks [~bibinchundatt],

Can we move this to the YARN-8471? There're whole bunch of logs we need to 
remove when no new allocation/reservation happens.

> Improve logs of Capacity Scheduler to better debug invalid states
> -
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-8459.001.patch, YARN-8459.002.patch, 
> YARN-8459.003.patch
>
>
> Improve logs in CS to better debug invalid states



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8453) Additional Unit tests to verify queue limit and max-limit with multiple resource types

2018-06-28 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526621#comment-16526621
 ] 

Wangda Tan commented on YARN-8453:
--

+1 to the patch, thanks [~sunilg].

> Additional Unit  tests to verify queue limit and max-limit with multiple 
> resource types
> ---
>
> Key: YARN-8453
> URL: https://issues.apache.org/jira/browse/YARN-8453
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.0.2
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8453.001.patch
>
>
> Post support of additional resource types other then CPU and Memory, it could 
> be possible that one such new resource is exhausted its quota on a given 
> queue. But other resources such as Memory / CPU is still there beyond its 
> guaranteed limit (under max-limit). Adding more units test to ensure we are 
> not starving such allocation requests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8453) Additional Unit tests to verify queue limit and max-limit with multiple resource types

2018-06-27 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8453:
-
Priority: Major  (was: Blocker)

> Additional Unit  tests to verify queue limit and max-limit with multiple 
> resource types
> ---
>
> Key: YARN-8453
> URL: https://issues.apache.org/jira/browse/YARN-8453
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.0.2
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8453.001.patch
>
>
> Post support of additional resource types other then CPU and Memory, it could 
> be possible that one such new resource is exhausted its quota on a given 
> queue. But other resources such as Memory / CPU is still there beyond its 
> guaranteed limit (under max-limit). Adding more units test to ensure we are 
> not starving such allocation requests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-27 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8459:
-
Priority: Major  (was: Critical)

> Improve logs of Capacity Scheduler to better debug invalid states
> -
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-8459.001.patch, YARN-8459.002.patch, 
> YARN-8459.003.patch
>
>
> Improve logs in CS to better debug invalid states



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-27 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525519#comment-16525519
 ] 

Wangda Tan commented on YARN-8459:
--

[~cheersyang], 
I come back to check the logic, this should not happen  .. 
Basically, all tryCommit / doneAppAttempt / removeNode holds CS write lock. I 
took some time but could not get the root cause. Logs are rolled so I cannot 
see the initial state as well. I just converted this JIRA to fix logs, and 
downgrade to critical.

cc: [~gopalv].

> Improve logs of Capacity Scheduler to better debug invalid states
> -
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8459.001.patch, YARN-8459.002.patch, 
> YARN-8459.003.patch
>
>
> Improve logs in CS to better debug invalid states



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-27 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8459:
-
Description: Improve logs in CS to better debug invalid states  (was: 
Improve logs in CS to better )

> Improve logs of Capacity Scheduler to better debug invalid states
> -
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8459.001.patch, YARN-8459.002.patch, 
> YARN-8459.003.patch
>
>
> Improve logs in CS to better debug invalid states



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-27 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8459:
-
Attachment: YARN-8459.003.patch

> Improve logs of Capacity Scheduler to better debug invalid states
> -
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8459.001.patch, YARN-8459.002.patch, 
> YARN-8459.003.patch
>
>
> Improve logs in CS to better debug invalid states



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-27 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8459:
-
Description: Improve logs in CS to better   (was: Thanks [~gopalv] for 
reporting this issue. 

In async mode, capacity scheduler can allocate/reserve containers on node/app 
when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}).

This will cause some issues, for example.

a. Container for app_1 reserved on node_x.
b. At the same time, app_1 is being removed.
c. Reserve on node operation finished after app_1 removed 
({{doneApplicationAttempt}}). 

For all the future runs, the node_x is completely blocked by the invalid 
reservation. It keep reporting "Trying to schedule for a finished app, please 
double check" for the node_x.

We need a fix to make sure this won't happen.)

> Improve logs of Capacity Scheduler to better debug invalid states
> -
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8459.001.patch, YARN-8459.002.patch, 
> YARN-8459.003.patch
>
>
> Improve logs in CS to better 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-27 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8459:
-
Summary: Improve logs of Capacity Scheduler to better debug invalid states  
(was: Capacity Scheduler should properly handle container allocation on 
app/node when app/node being removed by scheduler)

> Improve logs of Capacity Scheduler to better debug invalid states
> -
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-8459.001.patch, YARN-8459.002.patch
>
>
> Thanks [~gopalv] for reporting this issue. 
> In async mode, capacity scheduler can allocate/reserve containers on node/app 
> when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}).
> This will cause some issues, for example.
> a. Container for app_1 reserved on node_x.
> b. At the same time, app_1 is being removed.
> c. Reserve on node operation finished after app_1 removed 
> ({{doneApplicationAttempt}}). 
> For all the future runs, the node_x is completely blocked by the invalid 
> reservation. It keep reporting "Trying to schedule for a finished app, please 
> double check" for the node_x.
> We need a fix to make sure this won't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-27 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8459:
-
Priority: Critical  (was: Blocker)

> Improve logs of Capacity Scheduler to better debug invalid states
> -
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8459.001.patch, YARN-8459.002.patch
>
>
> Thanks [~gopalv] for reporting this issue. 
> In async mode, capacity scheduler can allocate/reserve containers on node/app 
> when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}).
> This will cause some issues, for example.
> a. Container for app_1 reserved on node_x.
> b. At the same time, app_1 is being removed.
> c. Reserve on node operation finished after app_1 removed 
> ({{doneApplicationAttempt}}). 
> For all the future runs, the node_x is completely blocked by the invalid 
> reservation. It keep reporting "Trying to schedule for a finished app, please 
> double check" for the node_x.
> We need a fix to make sure this won't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8379) Add an option to allow Capacity Scheduler preemption to balance satisfied queues

2018-06-27 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525328#comment-16525328
 ] 

Wangda Tan commented on YARN-8379:
--

bq. we could definitely make a method inside PreemptionCandidatesSelector, and 
call it explicitly to reset curCandidates per round, but this way it makes the 
code even harder to read. Any better suggestions here?
Can we simply new the curCandidates map inside {{selectCandidates}} for each 
selector? 

bq. This test case was intend to demonstrate selected candidates will be 
actually killed after custom timeout was reached. This part of code is the 
intention.
What I can see from the UT is, queue1 gets all containers (39G) and queue2 asks 
a 4G container. After wait the 4G containers will be preempted from queue1. I 
think our purpose is: both queue1 / queue2 are overutilized, we need to balance 
resources from queue1 to queue2 and only after X secs, containers from queue1 
will be preempted. correct? It will be similar to follow the example 
{{testPreemptionToBalanceUsedPlusPendingLessThanGuaranteed}}.

> Add an option to allow Capacity Scheduler preemption to balance satisfied 
> queues
> 
>
> Key: YARN-8379
> URL: https://issues.apache.org/jira/browse/YARN-8379
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Zian Chen
>Priority: Major
> Attachments: YARN-8379.001.patch, YARN-8379.002.patch, 
> YARN-8379.003.patch, YARN-8379.004.patch, YARN-8379.005.patch, 
> ericpayne.confs.tgz
>
>
> Existing capacity scheduler only supports preemption for an underutilized 
> queue to reach its guaranteed resource. In addition to that, there’s an 
> requirement to get better balance between queues when all of them reach 
> guaranteed resource but with different fairness resource.
> An example is, 3 queues with capacity, queue_a = 30%, queue_b = 30%, queue_c 
> = 40%. At time T. queue_a is using 30%, queue_b is using 70%. Existing 
> scheduler preemption won't happen. But this is unfair to queue_a since 
> queue_a has the same guaranteed resources.
> Before YARN-5864, capacity scheduler do additional preemption to balance 
> queues. We changed the logic since it could preempt too many containers 
> between queues when all queues are satisfied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8379) Add an option to allow Capacity Scheduler preemption to balance satisfied queues

2018-06-26 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524583#comment-16524583
 ] 

Wangda Tan commented on YARN-8379:
--

[~Zian Chen], 

Thanks for updating the patch,

Few comments: 

1) testPreemptionToBalanceWithCustomTimeout is better to move to a separate 
class. (Maybe something like TestCapacitySchedulerQueueBalancePreemption. The 
test looks not testing this feature, could u check it? I might misunderstand 
what you did here. 

2) For interface of {{selectCandidates}}, I think we can avoid passing the 
curCandidates, correct? According to semantics of curCandidates, it should be 
candidates selected *with in the selector*. 

An additional comment is:
- Now all selectors need to update two maps, curCandidates and 
selectedCandidates, this causes confusion and developers could forget updating 
both of them in some cases. Instead of doing this, I think we should refactor 
this part of code to simplify the logic. This can be done in a separated JIRA.  
[~Zian Chen], could u create a JIRA for this?

> Add an option to allow Capacity Scheduler preemption to balance satisfied 
> queues
> 
>
> Key: YARN-8379
> URL: https://issues.apache.org/jira/browse/YARN-8379
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Zian Chen
>Priority: Major
> Attachments: YARN-8379.001.patch, YARN-8379.002.patch, 
> YARN-8379.003.patch, YARN-8379.004.patch, YARN-8379.005.patch, 
> ericpayne.confs.tgz
>
>
> Existing capacity scheduler only supports preemption for an underutilized 
> queue to reach its guaranteed resource. In addition to that, there’s an 
> requirement to get better balance between queues when all of them reach 
> guaranteed resource but with different fairness resource.
> An example is, 3 queues with capacity, queue_a = 30%, queue_b = 30%, queue_c 
> = 40%. At time T. queue_a is using 30%, queue_b is using 70%. Existing 
> scheduler preemption won't happen. But this is unfair to queue_a since 
> queue_a has the same guaranteed resources.
> Before YARN-5864, capacity scheduler do additional preemption to balance 
> queues. We changed the logic since it could preempt too many containers 
> between queues when all queues are satisfied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8464) Async scheduling thread could be interrupted when there are no NodeManagers in cluster

2018-06-26 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8464:
-
Fix Version/s: 3.2.0

> Async scheduling thread could be interrupted when there are no NodeManagers 
> in cluster
> --
>
> Key: YARN-8464
> URL: https://issues.apache.org/jira/browse/YARN-8464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Charan Hebri
>Assignee: Sunil Govindan
>Priority: Blocker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8464.001.patch, YARN-8464.002.patch
>
>
> Test scenario:
> 1. Make either yarn.nodemanager.log-dirs/yarn.nodemanager.local-dirs read-only
> 2. Restart NMs via Ambari, none of them show up in the RM UI as expected
> 3. Revert back the read-only dirs and restart NMs
> 4. Include a non-existent dir in either 
> yarn.nodemanager.log-dirs/yarn.nodemanager.local-dirs (1 good existing dir + 
> 1 non-existing dir)
> 5. Restart NMs via Ambari, all NMs show as RUNNING with a Health Report 
> message as expected
> 6. Submit a MapReduce sleep job, job goes into ACCEPTED state
> 7. Job stays in ACCEPTED state forever even though all NMs are running and 
> have available memory
>  
> Credits to [~charanh] who found this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8466) Add Chaos Monkey unit test framework for feature validation in scale

2018-06-26 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524485#comment-16524485
 ] 

Wangda Tan commented on YARN-8466:
--

Thanks [~cheersyang], actually this JIRA is inspired by the distributed chaos 
monkey framework mentioned by you offline. 

For the UT-like binary, the benefit is we can really run smoke test in a 
self-contained way. W/o any env setup, we can do sanity test in minutes. And 
the mock framework allows to start/stop app/node really fast. 

And I can definitely see the value of distributed chaos monkey framework. If we 
can make the test can easily run, it will be super useful to run before any 
releases! 

[~sunilg], 
To me, the UT is not necessarily to use the same code base of the distributed 
one (of course, ideally share the same one, but in practice it could be hard). 

> Add Chaos Monkey unit test framework for feature validation in scale
> 
>
> Key: YARN-8466
> URL: https://issues.apache.org/jira/browse/YARN-8466
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Priority: Critical
> Attachments: YARN-8466.poc.001.patch
>
>
> Currently we don't have such framework for testing. 
> We need a framework to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler

2018-06-26 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524472#comment-16524472
 ] 

Wangda Tan commented on YARN-8459:
--

Thanks [~sunilg], 

Addressed #1. For #2, it is required since we need to revert changes in 
previous commonReserve.

> Capacity Scheduler should properly handle container allocation on app/node 
> when app/node being removed by scheduler
> ---
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-8459.001.patch, YARN-8459.002.patch
>
>
> Thanks [~gopalv] for reporting this issue. 
> In async mode, capacity scheduler can allocate/reserve containers on node/app 
> when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}).
> This will cause some issues, for example.
> a. Container for app_1 reserved on node_x.
> b. At the same time, app_1 is being removed.
> c. Reserve on node operation finished after app_1 removed 
> ({{doneApplicationAttempt}}). 
> For all the future runs, the node_x is completely blocked by the invalid 
> reservation. It keep reporting "Trying to schedule for a finished app, please 
> double check" for the node_x.
> We need a fix to make sure this won't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler

2018-06-26 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8459:
-
Attachment: YARN-8459.002.patch

> Capacity Scheduler should properly handle container allocation on app/node 
> when app/node being removed by scheduler
> ---
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-8459.001.patch, YARN-8459.002.patch
>
>
> Thanks [~gopalv] for reporting this issue. 
> In async mode, capacity scheduler can allocate/reserve containers on node/app 
> when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}).
> This will cause some issues, for example.
> a. Container for app_1 reserved on node_x.
> b. At the same time, app_1 is being removed.
> c. Reserve on node operation finished after app_1 removed 
> ({{doneApplicationAttempt}}). 
> For all the future runs, the node_x is completely blocked by the invalid 
> reservation. It keep reporting "Trying to schedule for a finished app, please 
> double check" for the node_x.
> We need a fix to make sure this won't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8466) Add Chaos Monkey unit test framework for feature validation in scale

2018-06-26 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8466:
-
Summary: Add Chaos Monkey unit test framework for feature validation in 
scale  (was: Add Chaos Monkey unit test framework for validation in scale)

> Add Chaos Monkey unit test framework for feature validation in scale
> 
>
> Key: YARN-8466
> URL: https://issues.apache.org/jira/browse/YARN-8466
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Priority: Critical
> Attachments: YARN-8466.poc.001.patch
>
>
> Currently we don't have such framework for testing. 
> We need a framework to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8466) Add Chaos Monkey unit test framework for validation in scale

2018-06-26 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524411#comment-16524411
 ] 

Wangda Tan commented on YARN-8466:
--

And btw: this is an interesting work, but I may not have bandwidth to finish 
this works in comprehensive way, if anybody has interests to work on the whole 
or sub feature, please let me know so that we can coordinate :).

> Add Chaos Monkey unit test framework for validation in scale
> 
>
> Key: YARN-8466
> URL: https://issues.apache.org/jira/browse/YARN-8466
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Priority: Critical
> Attachments: YARN-8466.poc.001.patch
>
>
> Currently we don't have such framework for testing. 
> We need a framework to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8466) Add Chaos Monkey unit test framework for validation in scale

2018-06-26 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8466:
-
Attachment: YARN-8466.poc.001.patch

> Add Chaos Monkey unit test framework for validation in scale
> 
>
> Key: YARN-8466
> URL: https://issues.apache.org/jira/browse/YARN-8466
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Priority: Critical
> Attachments: YARN-8466.poc.001.patch
>
>
> Currently we don't have such framework for testing. 
> We need a framework to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8466) Add Chaos Monkey unit test framework for validation in scale

2018-06-26 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524409#comment-16524409
 ] 

Wangda Tan commented on YARN-8466:
--

Added a prototype which includes example chaos monkey tests for schedulers.

Future works include: 
1) Do validation after the test.
2) Leverage Invariance checker.

Adding folks who might be interested: [~curino], [~cheersyang], [~Tao Yang], 
[~sunil.gov...@gmail.com], [~kkaranasos], [~jlowe]. 



> Add Chaos Monkey unit test framework for validation in scale
> 
>
> Key: YARN-8466
> URL: https://issues.apache.org/jira/browse/YARN-8466
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Priority: Critical
>
> Currently we don't have such framework for testing. 
> We need a framework to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8466) Add Chaos Monkey unit test framework for validation in scale

2018-06-26 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8466:


 Summary: Add Chaos Monkey unit test framework for validation in 
scale
 Key: YARN-8466
 URL: https://issues.apache.org/jira/browse/YARN-8466
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Wangda Tan


Currently we don't have such framework for testing. 

We need a framework to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8464) Async scheduling thread could be interrupted when there are no NodeManagers in cluster

2018-06-26 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524322#comment-16524322
 ] 

Wangda Tan commented on YARN-8464:
--

Patch LGTM, thanks [~sunilg], will commit today if no objections.

> Async scheduling thread could be interrupted when there are no NodeManagers 
> in cluster
> --
>
> Key: YARN-8464
> URL: https://issues.apache.org/jira/browse/YARN-8464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Charan Hebri
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8464.001.patch, YARN-8464.002.patch
>
>
> Test scenario:
> 1. Make either yarn.nodemanager.log-dirs/yarn.nodemanager.local-dirs read-only
> 2. Restart NMs via Ambari, none of them show up in the RM UI as expected
> 3. Revert back the read-only dirs and restart NMs
> 4. Include a non-existent dir in either 
> yarn.nodemanager.log-dirs/yarn.nodemanager.local-dirs (1 good existing dir + 
> 1 non-existing dir)
> 5. Restart NMs via Ambari, all NMs show as RUNNING with a Health Report 
> message as expected
> 6. Submit a MapReduce sleep job, job goes into ACCEPTED state
> 7. Job stays in ACCEPTED state forever even though all NMs are running and 
> have available memory
>  
> Credits to [~charanh] who found this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8464) Async scheduling thread could be interrupted when there are no NodeManagers in cluster

2018-06-26 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8464:
-
Priority: Blocker  (was: Critical)

> Async scheduling thread could be interrupted when there are no NodeManagers 
> in cluster
> --
>
> Key: YARN-8464
> URL: https://issues.apache.org/jira/browse/YARN-8464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Charan Hebri
>Assignee: Sunil Govindan
>Priority: Blocker
> Attachments: YARN-8464.001.patch, YARN-8464.002.patch
>
>
> Test scenario:
> 1. Make either yarn.nodemanager.log-dirs/yarn.nodemanager.local-dirs read-only
> 2. Restart NMs via Ambari, none of them show up in the RM UI as expected
> 3. Revert back the read-only dirs and restart NMs
> 4. Include a non-existent dir in either 
> yarn.nodemanager.log-dirs/yarn.nodemanager.local-dirs (1 good existing dir + 
> 1 non-existing dir)
> 5. Restart NMs via Ambari, all NMs show as RUNNING with a Health Report 
> message as expected
> 6. Submit a MapReduce sleep job, job goes into ACCEPTED state
> 7. Job stays in ACCEPTED state forever even though all NMs are running and 
> have available memory
>  
> Credits to [~charanh] who found this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8464) Application does not get to Running state even with available resources on node managers when async scheduling is enabled

2018-06-26 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524309#comment-16524309
 ] 

Wangda Tan commented on YARN-8464:
--

[~sunilg], mind to update the desc/title to the root cause?

{code}
526 // we can return from here itself.
527 if(nodes.size() == 0) {
528   return;
529 }
530 int start = random.nextInt(nodes.size());
{code}

The second node.size() will cause issue if race condition happens, Instead, you 
should cache the first one.

> Application does not get to Running state even with available resources on 
> node managers when async scheduling is enabled
> -
>
> Key: YARN-8464
> URL: https://issues.apache.org/jira/browse/YARN-8464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Charan Hebri
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8464.001.patch
>
>
> Test scenario:
> 1. Make either yarn.nodemanager.log-dirs/yarn.nodemanager.local-dirs read-only
> 2. Restart NMs via Ambari, none of them show up in the RM UI as expected
> 3. Revert back the read-only dirs and restart NMs
> 4. Include a non-existent dir in either 
> yarn.nodemanager.log-dirs/yarn.nodemanager.local-dirs (1 good existing dir + 
> 1 non-existing dir)
> 5. Restart NMs via Ambari, all NMs show as RUNNING with a Health Report 
> message as expected
> 6. Submit a MapReduce sleep job, job goes into ACCEPTED state
> 7. Job stays in ACCEPTED state forever even though all NMs are running and 
> have available memory
>  
> Credits to [~charanh] who found this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1013) CS should watch resource utilization of containers and allocate speculative containers if appropriate

2018-06-26 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524225#comment-16524225
 ] 

Wangda Tan commented on YARN-1013:
--

Thanks [~haibochen] for explanations,

bq. we are trying to just handle G resource requests with their enforcement 
flag set to false
This is the part I don't quite understand, where is the enforcement flag? Is it 
per app, per request or globally? 

bq. but the fair scheduler implementation (YARN-1015) tries to take into 
account of queue weight ...
Does this considers resource usages for O container or it is just consider G 
container usages?

> CS should watch resource utilization of containers and allocate speculative 
> containers if appropriate
> -
>
> Key: YARN-1013
> URL: https://issues.apache.org/jira/browse/YARN-1013
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun C Murthy
>Assignee: Weiwei Yang
>Priority: Major
>
> CS should watch resource utilization of containers (provided by NM in 
> heartbeat) and allocate speculative containers (at lower OS priority) if 
> appropriate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1013) CS should watch resource utilization of containers and allocate speculative containers if appropriate

2018-06-26 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524101#comment-16524101
 ] 

Wangda Tan commented on YARN-1013:
--

Just took a very quick look at YARN-1015. IIUC, scheduler allocates O 
containers when node uses more than guaranteed resource.

In my mind, problem of this approach is it cannot guarantee that allocated 
containers satisfy user's requirement. It doesn't check getExecutionTypeRequest 
of user's ResourceRequest, and it doesn't consider each app's pending O 
resource request, and queue's pending O resource request, etc. What if user 
doesn't want O containers? Similarly, YARN-6794 randomly promotes O container 
even if user doesn't care about container execution type.

The syntax of YARN-8178 is much simpler, application can avoid get O resource 
request if the resource is not preemptable. I like a proposal from [~curino] 
that we should add a flag to indicate resource request is Guaranteed and 
non-preemptable. Once we have that, we can get G container even if queue is 
preemptable.

Considering all CS features (user-limit, node partition, application priority, 
queue priority) may interact with O containers, I'm not sure how much effort 
required to cleanly support this in CS. Simply porting YARN-1015 to CS might be 
oversimplified to me.

> CS should watch resource utilization of containers and allocate speculative 
> containers if appropriate
> -
>
> Key: YARN-1013
> URL: https://issues.apache.org/jira/browse/YARN-1013
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun C Murthy
>Assignee: Weiwei Yang
>Priority: Major
>
> CS should watch resource utilization of containers (provided by NM in 
> heartbeat) and allocate speculative containers (at lower OS priority) if 
> appropriate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler

2018-06-26 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524057#comment-16524057
 ] 

Wangda Tan commented on YARN-8459:
--

[~cheersyang], 

According to our current locking design of CapacityScheduler:
1) Add/remove node/app requires CS lock. 
2) Allocate/release container acquires app/node/queue lock only for better 
performance. 

The simplest solution is to put allocate/release container under CS lock, but 
it will cause performance regression. Adding a stopping flag to app/node seems 
like cleanest solution in my mind, please share if you have any better idea.

[~sunilg], 

My intention is to put the setStopping under app/node lock instead of using 
volatile. We don't want a node is allocating container but the other thread is 
trying to remove the node.

> Capacity Scheduler should properly handle container allocation on app/node 
> when app/node being removed by scheduler
> ---
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-8459.001.patch
>
>
> Thanks [~gopalv] for reporting this issue. 
> In async mode, capacity scheduler can allocate/reserve containers on node/app 
> when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}).
> This will cause some issues, for example.
> a. Container for app_1 reserved on node_x.
> b. At the same time, app_1 is being removed.
> c. Reserve on node operation finished after app_1 removed 
> ({{doneApplicationAttempt}}). 
> For all the future runs, the node_x is completely blocked by the invalid 
> reservation. It keep reporting "Trying to schedule for a finished app, please 
> double check" for the node_x.
> We need a fix to make sure this won't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8462) Resource Manager shutdown with FATAL Exception

2018-06-26 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524042#comment-16524042
 ] 

Wangda Tan commented on YARN-8462:
--

[~jlowe], it seems this issue is fixed by YARN-8193 already.

> Resource Manager shutdown with FATAL Exception
> --
>
> Key: YARN-8462
> URL: https://issues.apache.org/jira/browse/YARN-8462
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Amithsha
>Priority: Critical
>
> Intermediately Resource manager going down with following exceptions 
>  
> 2018-06-25 15:24:30,572 FATAL event.EventDispatcher 
> (EventDispatcher.java:run(75)) - Error in handling event type NODE_UPDATE to 
> the Event Dispatcher
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.getLocalityWaitFactor(RegularContainerAllocator.java:268)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:315)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:388)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:469)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:250)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:819)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:857)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:55)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:868)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1121)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1338)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1333)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1422)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1197)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:1059)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1464)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:150)
>         at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>         at java.lang.Thread.run(Thread.java:745)
> 2018-06-25 15:24:30,573 INFO  event.EventDispatcher 
> (EventDispatcher.java:run(79)) - Exiting, bbye..
> 2018-06-25 15:24:30,579 ERROR delegation.AbstractDelegationTokenSecretManager 
> (AbstractDelegationTokenSecretManager.java:run(690)) - ExpiredTokenRemover 
> received java.lang.InterruptedException: sleep interrupted
>  
> Before the build we applied t

[jira] [Commented] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-25 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523208#comment-16523208
 ] 

Wangda Tan commented on YARN-8423:
--

+1, thanks [~sunilg], could u create a JIRA to add tests? Let's get this in 
first.

> GPU does not get released even though the application gets killed.
> --
>
> Key: YARN-8423
> URL: https://issues.apache.org/jira/browse/YARN-8423
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8423.001.patch, YARN-8423.002.patch, 
> YARN-8423.003.patch, kill-container-nm.log
>
>
> Run an Tensor flow app requesting one GPU.
> Kill the application once the GPU is allocated
> Query the nodemanger once the application is killed.We see that GPU is not 
> being released.
> {code}
>  curl -i /ws/v1/node/resources/yarn.io%2Fgpu
> {"gpuDeviceInformation":{"gpus":[{"productName":"","uuid":"GPU-","minorNumber":0,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}},{"productName":"","uuid":"GPU-","minorNumber":1,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}}],"driverVersion":""},"totalGpuDevices":[{"index":0,"minorNumber":0},{"index":1,"minorNumber":1}],"assignedGpuDevices":[{"index":0,"minorNumber":0,"containerId":"container_"}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler

2018-06-25 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522972#comment-16522972
 ] 

Wangda Tan commented on YARN-8459:
--

Attached ver.1 patch to run Jenkins, I felt it might be not straightforward to 
add tests. We need a lot of mock. I'm thinking to add a chaos-monkey-like UT to 
just randomly start/stop nodes/apps. We should be able to get some interesting 
results from that. 

Will update ver.2 patch with tests. 

cc: [~sunil.gov...@gmail.com], [~Tao Yang], [~cheersyang]. 

> Capacity Scheduler should properly handle container allocation on app/node 
> when app/node being removed by scheduler
> ---
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-8459.001.patch
>
>
> Thanks [~gopalv] for reporting this issue. 
> In async mode, capacity scheduler can allocate/reserve containers on node/app 
> when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}).
> This will cause some issues, for example.
> a. Container for app_1 reserved on node_x.
> b. At the same time, app_1 is being removed.
> c. Reserve on node operation finished after app_1 removed 
> ({{doneApplicationAttempt}}). 
> For all the future runs, the node_x is completely blocked by the invalid 
> reservation. It keep reporting "Trying to schedule for a finished app, please 
> double check" for the node_x.
> We need a fix to make sure this won't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler

2018-06-25 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8459:
-
Attachment: YARN-8459.001.patch

> Capacity Scheduler should properly handle container allocation on app/node 
> when app/node being removed by scheduler
> ---
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-8459.001.patch
>
>
> Thanks [~gopalv] for reporting this issue. 
> In async mode, capacity scheduler can allocate/reserve containers on node/app 
> when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}).
> This will cause some issues, for example.
> a. Container for app_1 reserved on node_x.
> b. At the same time, app_1 is being removed.
> c. Reserve on node operation finished after app_1 removed 
> ({{doneApplicationAttempt}}). 
> For all the future runs, the node_x is completely blocked by the invalid 
> reservation. It keep reporting "Trying to schedule for a finished app, please 
> double check" for the node_x.
> We need a fix to make sure this won't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler

2018-06-25 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8459:


 Summary: Capacity Scheduler should properly handle container 
allocation on app/node when app/node being removed by scheduler
 Key: YARN-8459
 URL: https://issues.apache.org/jira/browse/YARN-8459
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan
Assignee: Wangda Tan


Thanks [~gopalv] for reporting this issue. 

In async mode, capacity scheduler can allocate/reserve containers on node/app 
when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}).

This will cause some issues, for example.

a. Container for app_1 reserved on node_x.
b. At the same time, app_1 is being removed.
c. Reserve on node operation finished after app_1 removed 
({{doneApplicationAttempt}}). 

For all the future runs, the node_x is completely blocked by the invalid 
reservation. It keep reporting "Trying to schedule for a finished app, please 
double check" for the node_x.

We need a fix to make sure this won't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler

2018-06-25 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8459:
-
Affects Version/s: 3.1.0
 Target Version/s: 3.1.1
 Priority: Blocker  (was: Major)
  Component/s: capacity scheduler

> Capacity Scheduler should properly handle container allocation on app/node 
> when app/node being removed by scheduler
> ---
>
> Key: YARN-8459
> URL: https://issues.apache.org/jira/browse/YARN-8459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
>
> Thanks [~gopalv] for reporting this issue. 
> In async mode, capacity scheduler can allocate/reserve containers on node/app 
> when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}).
> This will cause some issues, for example.
> a. Container for app_1 reserved on node_x.
> b. At the same time, app_1 is being removed.
> c. Reserve on node operation finished after app_1 removed 
> ({{doneApplicationAttempt}}). 
> For all the future runs, the node_x is completely blocked by the invalid 
> reservation. It keep reporting "Trying to schedule for a finished app, please 
> double check" for the node_x.
> We need a fix to make sure this won't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8453) Allocation to a queue is dishonored if one resource is at the limit

2018-06-25 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8453:
-
Target Version/s: 3.1.1, 3.0.4
Priority: Blocker  (was: Major)

> Allocation to a queue is dishonored if one resource is at the limit
> ---
>
> Key: YARN-8453
> URL: https://issues.apache.org/jira/browse/YARN-8453
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.0.2
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Blocker
>
> Post support of additional resource types other then CPU and Memory, it could 
> be possible that one such new resource is exhausted its quota on a given 
> queue. But other resources such as Memory / CPU is still there beyond its 
> guaranteed limit (under max-limit). However as new resource is exhausted, 
> still containers will be failed to get that delta resources (cpu and memory). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-25 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522897#comment-16522897
 ] 

Wangda Tan commented on YARN-8220:
--

Attached ver.4 patch, removed duplicated contents inside Dockerfile and make 
them built from base images.

> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch, YARN-8220.002.patch, 
> YARN-8220.003.patch, YARN-8220.004.patch
>
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-25 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8220:
-
Attachment: YARN-8220.004.patch

> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch, YARN-8220.002.patch, 
> YARN-8220.003.patch, YARN-8220.004.patch
>
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-25 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522855#comment-16522855
 ] 

Wangda Tan commented on YARN-8220:
--

Attached ver.3 patch, added several fixes to submit-tf-job.py helper script. 
And added tensorboard to example launch spec. Thanks [~yanboliang] for offline 
suggestions and helps of these changes.

> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch, YARN-8220.002.patch, 
> YARN-8220.003.patch
>
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-25 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8220:
-
Attachment: YARN-8220.003.patch

> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch, YARN-8220.002.patch, 
> YARN-8220.003.patch
>
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-22 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520968#comment-16520968
 ] 

Wangda Tan commented on YARN-8423:
--

Thanks [~sunilg], 

Overall looks good, except: 
{code:java}
266 if (container.isContainerInFinalStates()) {
267 releasingGpus++;
268 }{code}
Instead of ++, you should add the actual # of allocated GPUs of the container. 

And even if the patch looks safe, could u add a test to make sure there's no 
regression in the future? 

> GPU does not get released even though the application gets killed.
> --
>
> Key: YARN-8423
> URL: https://issues.apache.org/jira/browse/YARN-8423
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8423.001.patch, YARN-8423.002.patch, 
> kill-container-nm.log
>
>
> Run an Tensor flow app requesting one GPU.
> Kill the application once the GPU is allocated
> Query the nodemanger once the application is killed.We see that GPU is not 
> being released.
> {code}
>  curl -i /ws/v1/node/resources/yarn.io%2Fgpu
> {"gpuDeviceInformation":{"gpus":[{"productName":"","uuid":"GPU-","minorNumber":0,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}},{"productName":"","uuid":"GPU-","minorNumber":1,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}}],"driverVersion":""},"totalGpuDevices":[{"index":0,"minorNumber":0},{"index":1,"minorNumber":1}],"assignedGpuDevices":[{"index":0,"minorNumber":0,"containerId":"container_"}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-14 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512889#comment-16512889
 ] 

Wangda Tan commented on YARN-8423:
--

Thanks [~shaneku...@gmail.com], I saw this error happens outside of Docker in 
Docker, it may cause by the same issue. We need the proposed workaround fix in 
any case to make code to be more robust.

> GPU does not get released even though the application gets killed.
> --
>
> Key: YARN-8423
> URL: https://issues.apache.org/jira/browse/YARN-8423
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8423.001.patch, kill-container-nm.log
>
>
> Run an Tensor flow app requesting one GPU.
> Kill the application once the GPU is allocated
> Query the nodemanger once the application is killed.We see that GPU is not 
> being released.
> {code}
>  curl -i /ws/v1/node/resources/yarn.io%2Fgpu
> {"gpuDeviceInformation":{"gpus":[{"productName":"","uuid":"GPU-","minorNumber":0,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}},{"productName":"","uuid":"GPU-","minorNumber":1,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}}],"driverVersion":""},"totalGpuDevices":[{"index":0,"minorNumber":0},{"index":1,"minorNumber":1}],"assignedGpuDevices":[{"index":0,"minorNumber":0,"containerId":"container_"}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-13 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-8423:


Assignee: Sunil Govindan  (was: Wangda Tan)

> GPU does not get released even though the application gets killed.
> --
>
> Key: YARN-8423
> URL: https://issues.apache.org/jira/browse/YARN-8423
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: kill-container-nm.log
>
>
> Run an Tensor flow app requesting one GPU.
> Kill the application once the GPU is allocated
> Query the nodemanger once the application is killed.We see that GPU is not 
> being released.
> {code}
>  curl -i /ws/v1/node/resources/yarn.io%2Fgpu
> {"gpuDeviceInformation":{"gpus":[{"productName":"","uuid":"GPU-","minorNumber":0,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}},{"productName":"","uuid":"GPU-","minorNumber":1,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}}],"driverVersion":""},"totalGpuDevices":[{"index":0,"minorNumber":0},{"index":1,"minorNumber":1}],"assignedGpuDevices":[{"index":0,"minorNumber":0,"containerId":"container_"}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-13 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512022#comment-16512022
 ] 

Wangda Tan commented on YARN-8423:
--

A possible simple fix to workaround the issue is to mark "releasing GPU" for 
container in killing stage, and add a wait logic inside 
{{GpuResourceAllocator#assignGpus}} before throw exception. But we may need a 
comprehensive solution since we have more resources to add, and this is a 
common severe issue for any resource need hard binding (like GPU / FPGA / CPU 
hard-binding: YARN-8320, cc: [~cheersyang]). 

Attached kill-container-nm.log which shows NM takes 2 mins to kill the (Docker) 
container. 

[~eyang], [~ebadger], [~shaneku...@gmail.com], is it normal to wait minutes to 
kill docker container? If yes, is there any way to speed it up? If no, what I 
can provide to troubleshoot the issue? I saw this happen frequently in our 
docker-in-docker setup.

> GPU does not get released even though the application gets killed.
> --
>
> Key: YARN-8423
> URL: https://issues.apache.org/jira/browse/YARN-8423
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: kill-container-nm.log
>
>
> Run an Tensor flow app requesting one GPU.
> Kill the application once the GPU is allocated
> Query the nodemanger once the application is killed.We see that GPU is not 
> being released.
> {code}
>  curl -i /ws/v1/node/resources/yarn.io%2Fgpu
> {"gpuDeviceInformation":{"gpus":[{"productName":"","uuid":"GPU-","minorNumber":0,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}},{"productName":"","uuid":"GPU-","minorNumber":1,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}}],"driverVersion":""},"totalGpuDevices":[{"index":0,"minorNumber":0},{"index":1,"minorNumber":1}],"assignedGpuDevices":[{"index":0,"minorNumber":0,"containerId":"container_"}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-13 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8423:
-
Target Version/s: 3.1.1

> GPU does not get released even though the application gets killed.
> --
>
> Key: YARN-8423
> URL: https://issues.apache.org/jira/browse/YARN-8423
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: kill-container-nm.log
>
>
> Run an Tensor flow app requesting one GPU.
> Kill the application once the GPU is allocated
> Query the nodemanger once the application is killed.We see that GPU is not 
> being released.
> {code}
>  curl -i /ws/v1/node/resources/yarn.io%2Fgpu
> {"gpuDeviceInformation":{"gpus":[{"productName":"","uuid":"GPU-","minorNumber":0,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}},{"productName":"","uuid":"GPU-","minorNumber":1,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}}],"driverVersion":""},"totalGpuDevices":[{"index":0,"minorNumber":0},{"index":1,"minorNumber":1}],"assignedGpuDevices":[{"index":0,"minorNumber":0,"containerId":"container_"}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-13 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8423:
-
Attachment: kill-container-nm.log

> GPU does not get released even though the application gets killed.
> --
>
> Key: YARN-8423
> URL: https://issues.apache.org/jira/browse/YARN-8423
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: kill-container-nm.log
>
>
> Run an Tensor flow app requesting one GPU.
> Kill the application once the GPU is allocated
> Query the nodemanger once the application is killed.We see that GPU is not 
> being released.
> {code}
>  curl -i /ws/v1/node/resources/yarn.io%2Fgpu
> {"gpuDeviceInformation":{"gpus":[{"productName":"","uuid":"GPU-","minorNumber":0,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}},{"productName":"","uuid":"GPU-","minorNumber":1,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}}],"driverVersion":""},"totalGpuDevices":[{"index":0,"minorNumber":0},{"index":1,"minorNumber":1}],"assignedGpuDevices":[{"index":0,"minorNumber":0,"containerId":"container_"}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-13 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512019#comment-16512019
 ] 

Wangda Tan commented on YARN-8423:
--

Thanks [~ssath...@hortonworks.com] for filing the issue.

I took a while to check the issue, it seems YARN NM takes more than 2 mins to 
kill a running container. Which causes the issue in description.

Here's a possible order to trigger the issue:

1) Container_1 has GPU resource on node_1
2) Application of container_1 got killed from RM. (So scheduler think the 
container is released).
3) Container_2 from another app got allocated on node_1 with some GPU 
resources. 
4) At the same time, RM notifies node_1 to kill container_1.
5) Because of some reason, container_1 not got killed immediately. (In the 
failed job, the container got killed after 2 mins!)
6) container_2 launch request arrives to node_1 before container_1 got killed. 
7) container_2 failed to launch because GPU resources is not marked as released 
in NM side.

This issue is not only related to GPU, but the GPU fails fast since it needs 
hard binding to GPU cores. I think we may need to revisit NM container launch 
behavior, in extreme cases, memory of NM could be overcommitted if new 
container arrives before old container fully killed.

cc: [~sunil.gov...@gmail.com]

> GPU does not get released even though the application gets killed.
> --
>
> Key: YARN-8423
> URL: https://issues.apache.org/jira/browse/YARN-8423
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
>
> Run an Tensor flow app requesting one GPU.
> Kill the application once the GPU is allocated
> Query the nodemanger once the application is killed.We see that GPU is not 
> being released.
> {code}
>  curl -i /ws/v1/node/resources/yarn.io%2Fgpu
> {"gpuDeviceInformation":{"gpus":[{"productName":"","uuid":"GPU-","minorNumber":0,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}},{"productName":"","uuid":"GPU-","minorNumber":1,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}}],"driverVersion":""},"totalGpuDevices":[{"index":0,"minorNumber":0},{"index":1,"minorNumber":1}],"assignedGpuDevices":[{"index":0,"minorNumber":0,"containerId":"container_"}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8415) TimelineWebServices.getEntity should throw a ForbiddenException(403) instead of 404 when ACL checks fail

2018-06-12 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8415:
-
Reporter: Sumana Sathish  (was: Suma Shivaprasad)

> TimelineWebServices.getEntity should throw a ForbiddenException(403) instead 
> of 404 when ACL checks fail
> 
>
> Key: YARN-8415
> URL: https://issues.apache.org/jira/browse/YARN-8415
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-8415.1.patch, YARN-8415.2.patch, YARN-8415.3.patch
>
>
> {noformat}
> private TimelineEntity doGetEntity(
>   String entityType,
>   String entityId,
>   EnumSet fields,
>   UserGroupInformation callerUGI) throws YarnException, IOException {
> TimelineEntity entity = null;
> entity =
> store.getEntity(entityId, entityType, fields);
> if (entity != null) {
>   addDefaultDomainIdIfAbsent(entity);
>   // check ACLs
>   if (!timelineACLsManager.checkAccess(
>   callerUGI, ApplicationAccessType.VIEW_APP, entity)) {
>   entity = null;   //Should differentiate from an entity get failure 
> vs ACL check failure here by throwing an Exception.*
>   }
> }
> return entity;
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-11 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509094#comment-16509094
 ] 

Wangda Tan commented on YARN-8220:
--

Discussed with [~eyang] about this and did some tests:

Currently, YARN NM passes JAVA_HOME, HDFS_HOME, CLASSPATH environments before 
launching Docker container no matter if ENTRY_POINT is used or not. This will 
overwrite environments defined inside Dockerfile (by using \{{ENV}}). For 
Docker container, it actually doesn't make sense to pass JAVA_HOME, HDFS_HOME, 
etc. because inside docker image we have a separate Java/Hadoop installed or 
mounted to exactly same directory of host machine.

I just filed YARN-8417 to revisit this behavior.

Once the above change is done, we actually don't need to presetup common 
configs inside service spec or presetup.sh, everything could be done very 
cleanly inside the Dockerfile.

For this patch:

Considering size of this patch, I suggest to get it merged before YARN-8417. We 
can continuously improve it (like using ENV/ENTRY_POINT) after YARN-8417 and 
feedbacks from others.

Really appreciate valuable inputs from [~eyang]!

> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch, YARN-8220.002.patch
>
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8417) Should skip passing HDFS_HOME, HADOOP_CONF_DIR, JAVA_HOME, etc. to Docker container.

2018-06-11 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8417:


 Summary: Should skip passing HDFS_HOME, HADOOP_CONF_DIR, 
JAVA_HOME, etc. to Docker container.
 Key: YARN-8417
 URL: https://issues.apache.org/jira/browse/YARN-8417
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


Currently, YARN NM passes JAVA_HOME, HDFS_HOME, CLASSPATH environments before 
launching Docker container no matter if ENTRY_POINT is used or not. This will 
overwrite environments defined inside Dockerfile (by using \{{ENV}}). For 
Docker container, it actually doesn't make sense to pass JAVA_HOME, HDFS_HOME, 
etc. because inside docker image we have a separate Java/Hadoop installed or 
mounted to exactly same directory of host machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery

2018-06-11 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16508958#comment-16508958
 ] 

Wangda Tan commented on YARN-8242:
--

Bulk update on non-blocker issues which are targeted to 3.1.1:

If this issue is absolutely required for 3.1.1, please upgrade priority to 
blocker. I'm working on 3.1.1 release now and will move this JIRAs to 3.1.2 
during the week. Thanks.

> YARN NM: OOM error while reading back the state store on recovery
> -
>
> Key: YARN-8242
> URL: https://issues.apache.org/jira/browse/YARN-8242
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.6.0, 2.9.0, 2.6.5, 2.8.3, 3.1.0, 2.7.6, 3.0.2
>Reporter: Kanwaljeet Sachdev
>Assignee: Kanwaljeet Sachdev
>Priority: Critical
> Attachments: YARN-8242.001.patch, YARN-8242.002.patch, 
> YARN-8242.003.patch
>
>
> On startup the NM reads its state store and builds a list of application in 
> the state store to process. If the number of applications in the state store 
> is large and have a lot of "state" connected to it the NM can run OOM and 
> never get to the point that it can start processing the recovery.
> Since it never starts the recovery there is no way for the NM to ever pass 
> this point. It will require a change in heap size to get the NM started.
>  
> Following is the stack trace
> {code:java}
> at java.lang.OutOfMemoryError. (OutOfMemoryError.java:48) at 
> com.google.protobuf.ByteString.copyFrom (ByteString.java:192) at 
> com.google.protobuf.CodedInputStream.readBytes (CodedInputStream.java:324) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. 
> (YarnProtos.java:47069) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. 
> (YarnProtos.java:47014) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom
>  (YarnProtos.java:47102) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom
>  (YarnProtos.java:47097) at com.google.protobuf.CodedInputStream.readMessage 
> (CodedInputStream.java:309) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. 
> (YarnProtos.java:41016) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. 
> (YarnProtos.java:40942) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
>  (YarnProtos.java:41080) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
>  (YarnProtos.java:41075) at com.google.protobuf.CodedInputStream.readMessage 
> (CodedInputStream.java:309) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.
>  (YarnServiceProtos.java:24517) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.
>  (YarnServiceProtos.java:24464) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
>  (YarnServiceProtos.java:24568) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
>  (YarnServiceProtos.java:24563) at 
> com.google.protobuf.AbstractParser.parsePartialFrom (AbstractParser.java:141) 
> at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:176) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:188) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:193) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:49) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.parseFrom
>  (YarnServiceProtos.java:24739) at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState
>  (NMLeveldbStateStoreService.java:217) at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState
>  (NMLeveldbStateStoreService.java:170) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover
>  (ContainerManagerImpl.java:253) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit
>  (ContainerManagerImpl.java:237) at 
> org.apache.hadoop.service.AbstractService.init (AbstractService.java:163) at 
> org.apache.hadoop.service.CompositeService.serviceInit 
> (CompositeService.java:107) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit 
> (NodeManager.java:255) at org.apache.hadoop.service.AbstractService.init 
> (AbstractService.java:163) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager 
> (NodeManager.java:474) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main 
> (NodeManager.java:521){code}



--
This message was sent by 

[jira] [Commented] (YARN-8414) Nodemanager crashes soon if ATSv2 HBase is either down or absent

2018-06-11 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16508954#comment-16508954
 ] 

Wangda Tan commented on YARN-8414:
--

Bulk update on non-blocker issues which are targeted to 3.1.1:

If this issue is absolutely required for 3.1.1, please upgrade priority to 
blocker. I'm working on 3.1.1 release now and will move this JIRAs to 3.1.2 
during the week. Thanks.

> Nodemanager crashes soon if ATSv2 HBase is either down or absent
> 
>
> Key: YARN-8414
> URL: https://issues.apache.org/jira/browse/YARN-8414
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Priority: Critical
>
> Test cluster has 1000 apps running, and a user trigger capacity scheduler 
> queue changes.  This crashes all node managers.  It looks like node manager 
> encounter too many files open while aggregating logs for containers:
> {code}
> 2018-06-07 21:17:59,307 WARN  server.AbstractConnector 
> (AbstractConnector.java:handleAcceptFailure(544)) -
> java.io.IOException: Too many open files
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at 
> org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:371)
> at 
> org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Thread.java:745)
> 2018-06-07 21:17:59,758 WARN  util.SysInfoLinux 
> (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; 
> can't determine memory settings
> 2018-06-07 21:17:59,758 WARN  util.SysInfoLinux 
> (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; 
> can't determine memory settings
> 2018-06-07 21:18:00,842 WARN  client.ConnectionUtils 
> (ConnectionUtils.java:getStubKey(236)) - Can not resolve host12.example.com, 
> please check your network
> java.net.UnknownHostException: host1.example.com: System error
> at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
> at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
> at 
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
> at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
> at java.net.InetAddress.getAllByName(InetAddress.java:1192)
> at java.net.InetAddress.getAllByName(InetAddress.java:1126)
> at java.net.InetAddress.getByName(InetAddress.java:1076)
> at 
> org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233)
> at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1189)
> at 
> org.apache.hadoop.hbase.client.ReversedScannerCallable.prepare(ReversedScannerCallable.java:111)
> at 
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:399)
> at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
> at 
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Timeline service has thousands of exceptions:
> {code}
> 2018-06-07 21:18:34,182 ERROR client.AsyncProcess 
> (AsyncProcess.java:submit(291)) - Failed to get region location
> java.io.InterruptedIOException
> at 
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:265)
> at 
> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437)
> at 
> org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312)
> at 
> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597)
> at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:834)
> at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:281)
> at 
> org.apache.hadoo

[jira] [Updated] (YARN-8257) Native service should automatically adding escapes for environment/launch cmd before sending to YARN

2018-06-11 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8257:
-
Target Version/s: 3.1.2

> Native service should automatically adding escapes for environment/launch cmd 
> before sending to YARN
> 
>
> Key: YARN-8257
> URL: https://issues.apache.org/jira/browse/YARN-8257
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Gour Saha
>Priority: Critical
>
> Noticed this issue while using native service: 
> Basically, when a string for environment / launch command contains chars like 
> ", /, `: it needs to be escaped twice.
> The first time is from json spec, because of json accept double quote only, 
> it needs an escape.
> The second time is from launch container, what we did for command line is: 
> (ContainerLaunch.java)
> {code:java}
> line("exec /bin/bash -c \"", StringUtils.join(" ", command), "\"");{code}
> And for environment:
> {code:java}
> line("export ", key, "=\"", value, "\"");{code}
> An example of launch_command: 
> {code:java}
> "launch_command": "export CLASSPATH=\\`\\$HADOOP_HDFS_HOME/bin/hadoop 
> classpath --glob\\`"{code}
> And example of environment:
> {code:java}
> "TF_CONFIG" : "{\\\"cluster\\\": {\\\"master\\\": 
> [\\\"master-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"ps\\\": 
> [\\\"ps-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"worker\\\": 
> [\\\"worker-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"]}, 
> \\\"task\\\": {\\\"type\\\":\\\"${COMPONENT_NAME}\\\", 
> \\\"index\\\":${COMPONENT_ID}}, \\\"environment\\\":\\\"cloud\\\"}",{code}
> To improve usability, I think we should auto escape the input string once. 
> (For example, if user specified 
> {code}
> "TF_CONFIG": "\"key\""
> {code}
> We will automatically escape it to:
> {code}
> "TF_CONFIG": \\\"key\\\"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8257) Native service should automatically adding escapes for environment/launch cmd before sending to YARN

2018-06-11 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8257:
-
Target Version/s:   (was: 3.1.1)

> Native service should automatically adding escapes for environment/launch cmd 
> before sending to YARN
> 
>
> Key: YARN-8257
> URL: https://issues.apache.org/jira/browse/YARN-8257
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Gour Saha
>Priority: Critical
>
> Noticed this issue while using native service: 
> Basically, when a string for environment / launch command contains chars like 
> ", /, `: it needs to be escaped twice.
> The first time is from json spec, because of json accept double quote only, 
> it needs an escape.
> The second time is from launch container, what we did for command line is: 
> (ContainerLaunch.java)
> {code:java}
> line("exec /bin/bash -c \"", StringUtils.join(" ", command), "\"");{code}
> And for environment:
> {code:java}
> line("export ", key, "=\"", value, "\"");{code}
> An example of launch_command: 
> {code:java}
> "launch_command": "export CLASSPATH=\\`\\$HADOOP_HDFS_HOME/bin/hadoop 
> classpath --glob\\`"{code}
> And example of environment:
> {code:java}
> "TF_CONFIG" : "{\\\"cluster\\\": {\\\"master\\\": 
> [\\\"master-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"ps\\\": 
> [\\\"ps-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"worker\\\": 
> [\\\"worker-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"]}, 
> \\\"task\\\": {\\\"type\\\":\\\"${COMPONENT_NAME}\\\", 
> \\\"index\\\":${COMPONENT_ID}}, \\\"environment\\\":\\\"cloud\\\"}",{code}
> To improve usability, I think we should auto escape the input string once. 
> (For example, if user specified 
> {code}
> "TF_CONFIG": "\"key\""
> {code}
> We will automatically escape it to:
> {code}
> "TF_CONFIG": \\\"key\\\"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2018-06-11 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16508952#comment-16508952
 ] 

Wangda Tan commented on YARN-8234:
--

Bulk update on non-blocker issues which are targeted to 3.1.1:

If this issue is absolutely required for 3.1.1, please upgrade priority to 
blocker. I'm working on 3.1.1 release now and will move this JIRAs to 3.1.2 
during the week. Thanks.

> Improve RM system metrics publisher's performance by pushing events to 
> timeline server in batch
> ---
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Critical
> Attachments: YARN-8234-branch-2.8.3.001.patch, 
> YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch
>
>
> When system metrics publisher is enabled, RM will push events to timeline 
> server via restful api. If the cluster load is heavy, many events are sent to 
> timeline server and the timeline server's event handler thread locked. 
> YARN-7266 talked about the detail of this problem. Because of the lock, 
> timeline server can't receive event as fast as it generated in RM and lots of 
> timeline event stays in RM's memory. Finally, those events will consume all 
> RM's memory and RM will start a full gc (which cause an JVM stop-world and 
> cause a timeout from rm to zookeeper) or even get an OOM. 
> The main problem here is that timeline can't receive timeline server's event 
> as fast as it generated. Now, RM system metrics publisher put only one event 
> in a request, and most time costs on handling http header or some thing about 
> the net connection on timeline side. Only few time is spent on dealing with 
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher 
> send events to timeline server in batch via one request. When sets the batch 
> size to 1000, in out experiment the speed of the timeline server receives 
> events has 100x improvement. We have implement this function int our product 
> environment which accepts 2 app's in one hour and it works fine.
> We add following configuration:
>  * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of 
> system metrics publisher sending events in one request. Default value is 1000
>  * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the 
> event buffer in system metrics publisher.
>  * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When 
> enable batch publishing, we must avoid that the publisher waits for a batch 
> to be filled up and hold events in buffer for long time. So we add another 
> thread which send event's in the buffer periodically. This config sets the 
> interval of the cyclical sending thread. The default value is 60s.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery

2018-06-11 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16508948#comment-16508948
 ] 

Wangda Tan commented on YARN-8242:
--

Given there's no more progress on this Jira, and this is not a regression, 
downgrade priority to critical.

> YARN NM: OOM error while reading back the state store on recovery
> -
>
> Key: YARN-8242
> URL: https://issues.apache.org/jira/browse/YARN-8242
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.6.0, 2.9.0, 2.6.5, 2.8.3, 3.1.0, 2.7.6, 3.0.2
>Reporter: Kanwaljeet Sachdev
>Assignee: Kanwaljeet Sachdev
>Priority: Critical
> Attachments: YARN-8242.001.patch, YARN-8242.002.patch, 
> YARN-8242.003.patch
>
>
> On startup the NM reads its state store and builds a list of application in 
> the state store to process. If the number of applications in the state store 
> is large and have a lot of "state" connected to it the NM can run OOM and 
> never get to the point that it can start processing the recovery.
> Since it never starts the recovery there is no way for the NM to ever pass 
> this point. It will require a change in heap size to get the NM started.
>  
> Following is the stack trace
> {code:java}
> at java.lang.OutOfMemoryError. (OutOfMemoryError.java:48) at 
> com.google.protobuf.ByteString.copyFrom (ByteString.java:192) at 
> com.google.protobuf.CodedInputStream.readBytes (CodedInputStream.java:324) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. 
> (YarnProtos.java:47069) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. 
> (YarnProtos.java:47014) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom
>  (YarnProtos.java:47102) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom
>  (YarnProtos.java:47097) at com.google.protobuf.CodedInputStream.readMessage 
> (CodedInputStream.java:309) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. 
> (YarnProtos.java:41016) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. 
> (YarnProtos.java:40942) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
>  (YarnProtos.java:41080) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
>  (YarnProtos.java:41075) at com.google.protobuf.CodedInputStream.readMessage 
> (CodedInputStream.java:309) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.
>  (YarnServiceProtos.java:24517) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.
>  (YarnServiceProtos.java:24464) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
>  (YarnServiceProtos.java:24568) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
>  (YarnServiceProtos.java:24563) at 
> com.google.protobuf.AbstractParser.parsePartialFrom (AbstractParser.java:141) 
> at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:176) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:188) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:193) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:49) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.parseFrom
>  (YarnServiceProtos.java:24739) at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState
>  (NMLeveldbStateStoreService.java:217) at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState
>  (NMLeveldbStateStoreService.java:170) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover
>  (ContainerManagerImpl.java:253) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit
>  (ContainerManagerImpl.java:237) at 
> org.apache.hadoop.service.AbstractService.init (AbstractService.java:163) at 
> org.apache.hadoop.service.CompositeService.serviceInit 
> (CompositeService.java:107) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit 
> (NodeManager.java:255) at org.apache.hadoop.service.AbstractService.init 
> (AbstractService.java:163) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager 
> (NodeManager.java:474) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main 
> (NodeManager.java:521){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issu

[jira] [Updated] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery

2018-06-11 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8242:
-
Priority: Critical  (was: Blocker)

> YARN NM: OOM error while reading back the state store on recovery
> -
>
> Key: YARN-8242
> URL: https://issues.apache.org/jira/browse/YARN-8242
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.6.0, 2.9.0, 2.6.5, 2.8.3, 3.1.0, 2.7.6, 3.0.2
>Reporter: Kanwaljeet Sachdev
>Assignee: Kanwaljeet Sachdev
>Priority: Critical
> Attachments: YARN-8242.001.patch, YARN-8242.002.patch, 
> YARN-8242.003.patch
>
>
> On startup the NM reads its state store and builds a list of application in 
> the state store to process. If the number of applications in the state store 
> is large and have a lot of "state" connected to it the NM can run OOM and 
> never get to the point that it can start processing the recovery.
> Since it never starts the recovery there is no way for the NM to ever pass 
> this point. It will require a change in heap size to get the NM started.
>  
> Following is the stack trace
> {code:java}
> at java.lang.OutOfMemoryError. (OutOfMemoryError.java:48) at 
> com.google.protobuf.ByteString.copyFrom (ByteString.java:192) at 
> com.google.protobuf.CodedInputStream.readBytes (CodedInputStream.java:324) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. 
> (YarnProtos.java:47069) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. 
> (YarnProtos.java:47014) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom
>  (YarnProtos.java:47102) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom
>  (YarnProtos.java:47097) at com.google.protobuf.CodedInputStream.readMessage 
> (CodedInputStream.java:309) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. 
> (YarnProtos.java:41016) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. 
> (YarnProtos.java:40942) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
>  (YarnProtos.java:41080) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
>  (YarnProtos.java:41075) at com.google.protobuf.CodedInputStream.readMessage 
> (CodedInputStream.java:309) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.
>  (YarnServiceProtos.java:24517) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.
>  (YarnServiceProtos.java:24464) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
>  (YarnServiceProtos.java:24568) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
>  (YarnServiceProtos.java:24563) at 
> com.google.protobuf.AbstractParser.parsePartialFrom (AbstractParser.java:141) 
> at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:176) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:188) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:193) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:49) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.parseFrom
>  (YarnServiceProtos.java:24739) at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState
>  (NMLeveldbStateStoreService.java:217) at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState
>  (NMLeveldbStateStoreService.java:170) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover
>  (ContainerManagerImpl.java:253) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit
>  (ContainerManagerImpl.java:237) at 
> org.apache.hadoop.service.AbstractService.init (AbstractService.java:163) at 
> org.apache.hadoop.service.CompositeService.serviceInit 
> (CompositeService.java:107) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit 
> (NodeManager.java:255) at org.apache.hadoop.service.AbstractService.init 
> (AbstractService.java:163) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager 
> (NodeManager.java:474) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main 
> (NodeManager.java:521){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-10 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16507553#comment-16507553
 ] 

Wangda Tan commented on YARN-8220:
--

[~eyang],

Fair enough, could u help to give some examples of how to use ENTRYPOINT (to 
expose multiple envars) and pass launch_command at the same time? Is there any 
configs needed?  

> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch, YARN-8220.002.patch
>
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-10 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16507523#comment-16507523
 ] 

Wangda Tan commented on YARN-8220:
--

Attached ver.2 patch, fixed jenkins reported warnings.

Addressed the {{git clone}} suggestion from [~eyang], now the scripts are 
embedded inside the project. 

For the entry point, it is a good feature but I think it may not be best suit 
the training example. We can consider to use it when we want to add the 
zeppelin + TF or tensorflow serving example. Sounds good, [~eyang]? 

> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch, YARN-8220.002.patch
>
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-10 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8220:
-
Attachment: YARN-8220.002.patch

> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch, YARN-8220.002.patch
>
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8394) Improve data locality documentation for Capacity Scheduler

2018-06-06 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503721#comment-16503721
 ] 

Wangda Tan commented on YARN-8394:
--

+1, thanks [~cheersyang] for the patch.

> Improve data locality documentation for Capacity Scheduler
> --
>
> Key: YARN-8394
> URL: https://issues.apache.org/jira/browse/YARN-8394
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8394.001.patch
>
>
> YARN-6344 introduces a new parameter 
> {{yarn.scheduler.capacity.rack-locality-additional-delay}} in 
> capacity-scheduler.xml, we need to add some documentation in 
> {{CapacityScheduler.md}} accordingly.
> Moreover, we are seeing more and more clusters are separating storage and 
> computation where file system is always remote, in such cases we need to 
> introduce how to compromise data locality in CS otherwise MR jobs are 
> suffering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5139) [Umbrella] Move YARN scheduler towards global scheduler

2018-06-05 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501947#comment-16501947
 ] 

Wangda Tan commented on YARN-5139:
--

[~zhuqi], It is committed to 2.9.0 and after. Welcome to help :), not sure if 
any FS committers can help review this. cc: [~haibochen]

> [Umbrella] Move YARN scheduler towards global scheduler
> ---
>
> Key: YARN-5139
> URL: https://issues.apache.org/jira/browse/YARN-5139
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: Explanantions of Global Scheduling (YARN-5139) 
> Implementation.pdf, YARN-5139-Concurrent-scheduling-performance-report.pdf, 
> YARN-5139-Global-Schedulingd-esign-and-implementation-notes-v2.pdf, 
> YARN-5139-Global-Schedulingd-esign-and-implementation-notes.pdf, 
> YARN-5139.000.patch, wip-1.YARN-5139.patch, wip-2.YARN-5139.patch, 
> wip-3.YARN-5139.patch, wip-4.YARN-5139.patch, wip-5.YARN-5139.patch
>
>
> Existing YARN scheduler is based on node heartbeat. This can lead to 
> sub-optimal decisions because scheduler can only look at one node at the time 
> when scheduling resources.
> Pseudo code of existing scheduling logic looks like:
> {code}
> for node in allNodes:
>Go to parentQueue
>   Go to leafQueue
> for application in leafQueue.applications:
>for resource-request in application.resource-requests
>   try to schedule on node
> {code}
> Considering future complex resource placement requirements, such as node 
> constraints (give me "a && b || c") or anti-affinity (do not allocate HBase 
> regionsevers and Storm workers on the same host), we may need to consider 
> moving YARN scheduler towards global scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-02 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499130#comment-16499130
 ] 

Wangda Tan commented on YARN-8220:
--

Reopened Jira to trigger Jenkins.

> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch
>
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-02 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reopened YARN-8220:
--

> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch
>
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-02 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-8220.
--
Resolution: Later

> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch
>
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8349) Remove YARN registry entries when a service is killed by the RM

2018-06-01 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8349:
-
Priority: Critical  (was: Major)

> Remove YARN registry entries when a service is killed by the RM
> ---
>
> Key: YARN-8349
> URL: https://issues.apache.org/jira/browse/YARN-8349
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Shane Kumpf
>Assignee: Billie Rinaldi
>Priority: Critical
> Attachments: YARN-8349.1.patch, YARN-8349.2.patch, YARN-8349.3.patch, 
> YARN-8349.4.patch
>
>
> As the title states, when a service is killed by the RM (for exceeding its 
> lifetime for example), the YARN registry entries should be cleaned up.
> Without cleanup, DNS can contain multiple hostnames for a single IP address 
> in the case where IPs are reused. This impacts reverse lookups, which breaks 
> services, such as kerberos, that depend on those lookups.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8372) Distributed shell app master should not release containers when shutdown if keep-container is true

2018-06-01 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8372:
-
Priority: Critical  (was: Major)

> Distributed shell app master should not release containers when shutdown if 
> keep-container is true
> --
>
> Key: YARN-8372
> URL: https://issues.apache.org/jira/browse/YARN-8372
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell
>Reporter: Charan Hebri
>Assignee: Suma Shivaprasad
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8372.1.patch, YARN-8372.2.patch, YARN-8372.3.patch
>
>
> {noformat}
> try {
>   response = client.allocate(progress);
> } catch (ApplicationAttemptNotFoundException e) {
> handler.onShutdownRequest();
> LOG.info("Shutdown requested. Stopping callback.");
> return;{noformat}
> is a code snippet from AMRMClientAsyncImpl. The corresponding 
> onShutdownRequest call for the Distributed Shell App master,
> {noformat}
> @Override
> public void onShutdownRequest() {
>   done = true;
> }{noformat}
> Due to the above change, the current behavior is that whenever an application 
> attempt fails due to a NM restart (NM where the DS AM is running), an 
> ApplicationAttemptNotFoundException is thrown and all containers for that 
> attempt including the ones that are running on other NMs are killed by the AM 
> and marked as COMPLETE. The subsequent attempt spawns new containers just 
> like a new attempt. This behavior is different to a Map Reduce application 
> where the containers are not killed.
> cc [~rohithsharma]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8372) Distributed shell app master should not release containers when shutdown if keep-container is true

2018-06-01 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8372:
-
Summary: Distributed shell app master should not release containers when 
shutdown if keep-container is true  (was: ApplicationAttemptNotFoundException 
should be handled correctly by Distributed Shell App Master)

> Distributed shell app master should not release containers when shutdown if 
> keep-container is true
> --
>
> Key: YARN-8372
> URL: https://issues.apache.org/jira/browse/YARN-8372
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell
>Reporter: Charan Hebri
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-8372.1.patch, YARN-8372.2.patch, YARN-8372.3.patch
>
>
> {noformat}
> try {
>   response = client.allocate(progress);
> } catch (ApplicationAttemptNotFoundException e) {
> handler.onShutdownRequest();
> LOG.info("Shutdown requested. Stopping callback.");
> return;{noformat}
> is a code snippet from AMRMClientAsyncImpl. The corresponding 
> onShutdownRequest call for the Distributed Shell App master,
> {noformat}
> @Override
> public void onShutdownRequest() {
>   done = true;
> }{noformat}
> Due to the above change, the current behavior is that whenever an application 
> attempt fails due to a NM restart (NM where the DS AM is running), an 
> ApplicationAttemptNotFoundException is thrown and all containers for that 
> attempt including the ones that are running on other NMs are killed by the AM 
> and marked as COMPLETE. The subsequent attempt spawns new containers just 
> like a new attempt. This behavior is different to a Map Reduce application 
> where the containers are not killed.
> cc [~rohithsharma]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7962) Race Condition When Stopping DelegationTokenRenewer causes RM crash during failover

2018-06-01 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7962:
-
Summary: Race Condition When Stopping DelegationTokenRenewer causes RM 
crash during failover  (was: Race Condition When Stopping 
DelegationTokenRenewer)

> Race Condition When Stopping DelegationTokenRenewer causes RM crash during 
> failover
> ---
>
> Key: YARN-7962
> URL: https://issues.apache.org/jira/browse/YARN-7962
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Critical
> Attachments: YARN-7962.1.patch, YARN-7962.2.patch, YARN-7962.3.patch, 
> YARN-7962.4.patch, YARN-7962.6.patch, YARN-7962.7.patch
>
>
> [https://github.com/apache/hadoop/blob/69fa81679f59378fd19a2c65db8019393d7c05a2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java]
> {code:java}
>   private ThreadPoolExecutor renewerService;
>   private void processDelegationTokenRenewerEvent(
>   DelegationTokenRenewerEvent evt) {
> serviceStateLock.readLock().lock();
> try {
>   if (isServiceStarted) {
> renewerService.execute(new DelegationTokenRenewerRunnable(evt));
>   } else {
> pendingEventQueue.add(evt);
>   }
> } finally {
>   serviceStateLock.readLock().unlock();
> }
>   }
>   @Override
>   protected void serviceStop() {
> if (renewalTimer != null) {
>   renewalTimer.cancel();
> }
> appTokens.clear();
> allTokens.clear();
> this.renewerService.shutdown();
> {code}
> {code:java}
> 2018-02-21 11:18:16,253  FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.util.concurrent.RejectedExecutionException: Task 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable@39bddaf2
>  rejected from java.util.concurrent.ThreadPoolExecutor@5f71637b[Terminated, 
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 15487]
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.processDelegationTokenRenewerEvent(DelegationTokenRenewer.java:196)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.applicationFinished(DelegationTokenRenewer.java:734)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.finishApplication(RMAppManager.java:199)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:424)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:65)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:177)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> What I think is going on here is that the {{serviceStop}} method is not 
> setting the {{isServiceStarted}} flag to 'false'.
> Please update so that the {{serviceStop}} method grabs the 
> {{serviceStateLock}} and sets {{isServiceStarted}} to _false_, before 
> shutting down the {{renewerService}} thread pool, to avoid this condition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7962) Race Condition When Stopping DelegationTokenRenewer

2018-06-01 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-7962:


Assignee: BELUGA BEHR

> Race Condition When Stopping DelegationTokenRenewer
> ---
>
> Key: YARN-7962
> URL: https://issues.apache.org/jira/browse/YARN-7962
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Critical
> Attachments: YARN-7962.1.patch, YARN-7962.2.patch, YARN-7962.3.patch, 
> YARN-7962.4.patch, YARN-7962.6.patch, YARN-7962.7.patch
>
>
> [https://github.com/apache/hadoop/blob/69fa81679f59378fd19a2c65db8019393d7c05a2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java]
> {code:java}
>   private ThreadPoolExecutor renewerService;
>   private void processDelegationTokenRenewerEvent(
>   DelegationTokenRenewerEvent evt) {
> serviceStateLock.readLock().lock();
> try {
>   if (isServiceStarted) {
> renewerService.execute(new DelegationTokenRenewerRunnable(evt));
>   } else {
> pendingEventQueue.add(evt);
>   }
> } finally {
>   serviceStateLock.readLock().unlock();
> }
>   }
>   @Override
>   protected void serviceStop() {
> if (renewalTimer != null) {
>   renewalTimer.cancel();
> }
> appTokens.clear();
> allTokens.clear();
> this.renewerService.shutdown();
> {code}
> {code:java}
> 2018-02-21 11:18:16,253  FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.util.concurrent.RejectedExecutionException: Task 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable@39bddaf2
>  rejected from java.util.concurrent.ThreadPoolExecutor@5f71637b[Terminated, 
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 15487]
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.processDelegationTokenRenewerEvent(DelegationTokenRenewer.java:196)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.applicationFinished(DelegationTokenRenewer.java:734)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.finishApplication(RMAppManager.java:199)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:424)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:65)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:177)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> What I think is going on here is that the {{serviceStop}} method is not 
> setting the {{isServiceStarted}} flag to 'false'.
> Please update so that the {{serviceStop}} method grabs the 
> {{serviceStateLock}} and sets {{isServiceStarted}} to _false_, before 
> shutting down the {{renewerService}} thread pool, to avoid this condition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8384) stdout.txt, stderr.txt logs of a launched docker container is coming with primary group of submit user instead of hadoop

2018-06-01 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8384:
-
Description: 
When {{yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users}} 
is set to true, and 
{{yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user}} is set 
to nobody.

This will cause the docker to run as nobody:nobody in yarn mode.
The log files will be initialized as nobody:nobody:

{noformat}
rw-rr- 1 nobody hadoop 354 May 31 17:33 container-localizer-syslog
rw-rr- 1 nobody hadoop 1042 May 31 17:35 directory.info
rw-r 1 nobody hadoop 4944 May 31 17:35 launch_container.sh
rw-rr- 1 nobody hadoop 440 May 31 17:35 prelaunch.err
rw-rr- 1 nobody hadoop 100 May 31 17:35 prelaunch.out
rw-r 1 nobody nobody 18733 May 31 17:37 stderr.txt
rw-r 1 nobody nobody 400 May 31 17:35 stdout.txt
{noformat}

This causes YARN NM cannot read stderr.txt and stdout.txt


  was:
When {{yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users}} 
is set to true, and 
{{yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user}} is set 
to nobody.

This will cause the docker to run as nobody:nobody in yarn mode.
The log files will be initialized as nobody:nobody:

{noformat}
rw-rr- 1 nobody hadoop 354 May 31 17:33 container-localizer-syslog
rw-rr- 1 nobody hadoop 1042 May 31 17:35 directory.info
rw-r 1 nobody hadoop 4944 May 31 17:35 launch_container.sh
rw-rr- 1 nobody hadoop 440 May 31 17:35 prelaunch.err
rw-rr- 1 nobody hadoop 100 May 31 17:35 prelaunch.out
rw-r 1 nobody nobody 18733 May 31 17:37 stderr.txt
rw-r 1 nobody nobody 400 May 31 17:35 stdout.txt
{noformat}




> stdout.txt, stderr.txt logs of a launched docker container is coming with 
> primary group of submit user instead of hadoop
> 
>
> Key: YARN-8384
> URL: https://issues.apache.org/jira/browse/YARN-8384
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Eric Yang
>Priority: Critical
>  Labels: docker
> Attachments: YARN-8384.001.patch
>
>
> When {{yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users}} 
> is set to true, and 
> {{yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user}} is 
> set to nobody.
> This will cause the docker to run as nobody:nobody in yarn mode.
> The log files will be initialized as nobody:nobody:
> {noformat}
> rw-rr- 1 nobody hadoop 354 May 31 17:33 container-localizer-syslog
> rw-rr- 1 nobody hadoop 1042 May 31 17:35 directory.info
> rw-r 1 nobody hadoop 4944 May 31 17:35 launch_container.sh
> rw-rr- 1 nobody hadoop 440 May 31 17:35 prelaunch.err
> rw-rr- 1 nobody hadoop 100 May 31 17:35 prelaunch.out
> rw-r 1 nobody nobody 18733 May 31 17:37 stderr.txt
> rw-r 1 nobody nobody 400 May 31 17:35 stdout.txt
> {noformat}
> This causes YARN NM cannot read stderr.txt and stdout.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8384) stdout.txt, stderr.txt logs of a launched docker container is coming with primary group of submit user instead of hadoop

2018-06-01 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8384:
-
Summary: stdout.txt, stderr.txt logs of a launched docker container is 
coming with primary group of submit user instead of hadoop  (was: stdout, 
stderr logs of a Native Service container is coming with group as nobody)

> stdout.txt, stderr.txt logs of a launched docker container is coming with 
> primary group of submit user instead of hadoop
> 
>
> Key: YARN-8384
> URL: https://issues.apache.org/jira/browse/YARN-8384
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Eric Yang
>Priority: Critical
>  Labels: docker
> Attachments: YARN-8384.001.patch
>
>
> When {{yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users}} 
> is set to true, and 
> {{yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user}} is 
> set to nobody.
> This will cause the docker to run as nobody:nobody in yarn mode.
> The log files will be initialized as nobody:nobody:
> {noformat}
> rw-rr- 1 nobody hadoop 354 May 31 17:33 container-localizer-syslog
> rw-rr- 1 nobody hadoop 1042 May 31 17:35 directory.info
> rw-r 1 nobody hadoop 4944 May 31 17:35 launch_container.sh
> rw-rr- 1 nobody hadoop 440 May 31 17:35 prelaunch.err
> rw-rr- 1 nobody hadoop 100 May 31 17:35 prelaunch.out
> rw-r 1 nobody nobody 18733 May 31 17:37 stderr.txt
> rw-r 1 nobody nobody 400 May 31 17:35 stdout.txt
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8372) ApplicationAttemptNotFoundException should be handled correctly by Distributed Shell App Master

2018-06-01 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498548#comment-16498548
 ] 

Wangda Tan commented on YARN-8372:
--

Patch LGTM, +1. Will commit today if no objections.

> ApplicationAttemptNotFoundException should be handled correctly by 
> Distributed Shell App Master
> ---
>
> Key: YARN-8372
> URL: https://issues.apache.org/jira/browse/YARN-8372
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell
>Reporter: Charan Hebri
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-8372.1.patch, YARN-8372.2.patch, YARN-8372.3.patch
>
>
> {noformat}
> try {
>   response = client.allocate(progress);
> } catch (ApplicationAttemptNotFoundException e) {
> handler.onShutdownRequest();
> LOG.info("Shutdown requested. Stopping callback.");
> return;{noformat}
> is a code snippet from AMRMClientAsyncImpl. The corresponding 
> onShutdownRequest call for the Distributed Shell App master,
> {noformat}
> @Override
> public void onShutdownRequest() {
>   done = true;
> }{noformat}
> Due to the above change, the current behavior is that whenever an application 
> attempt fails due to a NM restart (NM where the DS AM is running), an 
> ApplicationAttemptNotFoundException is thrown and all containers for that 
> attempt including the ones that are running on other NMs are killed by the AM 
> and marked as COMPLETE. The subsequent attempt spawns new containers just 
> like a new attempt. This behavior is different to a Map Reduce application 
> where the containers are not killed.
> cc [~rohithsharma]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8349) Remove YARN registry entries when a service is killed by the RM

2018-06-01 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498543#comment-16498543
 ] 

Wangda Tan commented on YARN-8349:
--

Thank [~billie.rinaldi] for the patch, +1. Will commit by today if no 
objections.

> Remove YARN registry entries when a service is killed by the RM
> ---
>
> Key: YARN-8349
> URL: https://issues.apache.org/jira/browse/YARN-8349
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Shane Kumpf
>Assignee: Billie Rinaldi
>Priority: Major
> Attachments: YARN-8349.1.patch, YARN-8349.2.patch, YARN-8349.3.patch, 
> YARN-8349.4.patch
>
>
> As the title states, when a service is killed by the RM (for exceeding its 
> lifetime for example), the YARN registry entries should be cleaned up.
> Without cleanup, DNS can contain multiple hostnames for a single IP address 
> in the case where IPs are reused. This impacts reverse lookups, which breaks 
> services, such as kerberos, that depend on those lookups.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8379) Add an option to allow Capacity Scheduler preemption to balance satisfied queues

2018-06-01 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8379:
-
Description: 
Existing capacity scheduler only supports preemption for an underutilized queue 
to reach its guaranteed resource. In addition to that, there’s an requirement 
to get better balance between queues when all of them reach guaranteed resource 
but with different fairness resource.

An example is, 3 queues with capacity, queue_a = 30%, queue_b = 30%, queue_c = 
40%. At time T. queue_a is using 30%, queue_b is using 70%. Existing scheduler 
preemption won't happen. But this is unfair to queue_b since queue_a has the 
same guaranteed resources.

Before YARN-5864, capacity scheduler do additional preemption to balance 
queues. We changed the logic since it could preempt too many containers between 
queues when all queues are satisfied.

  was:
Existing capacity scheduler only supports preemption for an underutilized queue 
to reach its guaranteed resource. In addition to that, there’s an requirement 
to get better balance between queues when all of them reach guaranteed resource 
but with different fairness resource.

An example is, 3 queues with capacity, queue_a = 30%, queue_b = 30%, queue_c = 
40%. At time T. queue_a is using 30%, queue_b is using 70%. Existing scheduler 
preemption won't happen. But this is unfair to queue_b since queue_b has the 
same guaranteed resources.

Before YARN-5864, capacity scheduler do additional preemption to balance 
queues. We changed the logic since it could preempt too many containers between 
queues when all queues are satisfied.


> Add an option to allow Capacity Scheduler preemption to balance satisfied 
> queues
> 
>
> Key: YARN-8379
> URL: https://issues.apache.org/jira/browse/YARN-8379
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>
> Existing capacity scheduler only supports preemption for an underutilized 
> queue to reach its guaranteed resource. In addition to that, there’s an 
> requirement to get better balance between queues when all of them reach 
> guaranteed resource but with different fairness resource.
> An example is, 3 queues with capacity, queue_a = 30%, queue_b = 30%, queue_c 
> = 40%. At time T. queue_a is using 30%, queue_b is using 70%. Existing 
> scheduler preemption won't happen. But this is unfair to queue_b since 
> queue_a has the same guaranteed resources.
> Before YARN-5864, capacity scheduler do additional preemption to balance 
> queues. We changed the logic since it could preempt too many containers 
> between queues when all queues are satisfied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8379) Add an option to allow Capacity Scheduler preemption to balance satisfied queues

2018-06-01 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8379:
-
Description: 
Existing capacity scheduler only supports preemption for an underutilized queue 
to reach its guaranteed resource. In addition to that, there’s an requirement 
to get better balance between queues when all of them reach guaranteed resource 
but with different fairness resource.

An example is, 3 queues with capacity, queue_a = 30%, queue_b = 30%, queue_c = 
40%. At time T. queue_a is using 30%, queue_b is using 70%. Existing scheduler 
preemption won't happen. But this is unfair to queue_a since queue_a has the 
same guaranteed resources.

Before YARN-5864, capacity scheduler do additional preemption to balance 
queues. We changed the logic since it could preempt too many containers between 
queues when all queues are satisfied.

  was:
Existing capacity scheduler only supports preemption for an underutilized queue 
to reach its guaranteed resource. In addition to that, there’s an requirement 
to get better balance between queues when all of them reach guaranteed resource 
but with different fairness resource.

An example is, 3 queues with capacity, queue_a = 30%, queue_b = 30%, queue_c = 
40%. At time T. queue_a is using 30%, queue_b is using 70%. Existing scheduler 
preemption won't happen. But this is unfair to queue_b since queue_a has the 
same guaranteed resources.

Before YARN-5864, capacity scheduler do additional preemption to balance 
queues. We changed the logic since it could preempt too many containers between 
queues when all queues are satisfied.


> Add an option to allow Capacity Scheduler preemption to balance satisfied 
> queues
> 
>
> Key: YARN-8379
> URL: https://issues.apache.org/jira/browse/YARN-8379
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>
> Existing capacity scheduler only supports preemption for an underutilized 
> queue to reach its guaranteed resource. In addition to that, there’s an 
> requirement to get better balance between queues when all of them reach 
> guaranteed resource but with different fairness resource.
> An example is, 3 queues with capacity, queue_a = 30%, queue_b = 30%, queue_c 
> = 40%. At time T. queue_a is using 30%, queue_b is using 70%. Existing 
> scheduler preemption won't happen. But this is unfair to queue_a since 
> queue_a has the same guaranteed resources.
> Before YARN-5864, capacity scheduler do additional preemption to balance 
> queues. We changed the logic since it could preempt too many containers 
> between queues when all queues are satisfied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-01 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498270#comment-16498270
 ] 

Wangda Tan commented on YARN-8220:
--

Thanks [~eyang] for your comments,

For your comments:
bq. 1. Avoid using bash style launch command 
Entry point is a nice feature for static command. (For example default TF 
docker image which start notebook by default: 
https://github.com/tensorflow/tensorflow/tree/r1.8/tensorflow/tools/docker). 
For training program, since user need to do a lot of hyper parameter tuning, 
user will update such parameters to make it work. 

bq. 2. It might be good to show case some yarnfile features:
We intentionally want to avoid user specify this. It is a burden for user to 
specify such mounting. In side submit_tf.py, we use the feature you mentioned. 

bq. 3. Downloading source code from individual github contributors might be 
risky and prone to break
This is a good suggestion, will check if it is possible to commit example code 
to sub folder of this example.

> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch
>
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7962) Race Condition When Stopping DelegationTokenRenewer

2018-06-01 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498246#comment-16498246
 ] 

Wangda Tan commented on YARN-7962:
--

Thanks [~billie.rinaldi], to me, differences between ver.6 and ver.7 patch is 
minimum. Since the isServiceStarted is volatile, both patches are safe. +1 to 
the ver.7 patch, will commit today if no objections.

> Race Condition When Stopping DelegationTokenRenewer
> ---
>
> Key: YARN-7962
> URL: https://issues.apache.org/jira/browse/YARN-7962
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: BELUGA BEHR
>Priority: Critical
> Attachments: YARN-7962.1.patch, YARN-7962.2.patch, YARN-7962.3.patch, 
> YARN-7962.4.patch, YARN-7962.6.patch, YARN-7962.7.patch
>
>
> [https://github.com/apache/hadoop/blob/69fa81679f59378fd19a2c65db8019393d7c05a2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java]
> {code:java}
>   private ThreadPoolExecutor renewerService;
>   private void processDelegationTokenRenewerEvent(
>   DelegationTokenRenewerEvent evt) {
> serviceStateLock.readLock().lock();
> try {
>   if (isServiceStarted) {
> renewerService.execute(new DelegationTokenRenewerRunnable(evt));
>   } else {
> pendingEventQueue.add(evt);
>   }
> } finally {
>   serviceStateLock.readLock().unlock();
> }
>   }
>   @Override
>   protected void serviceStop() {
> if (renewalTimer != null) {
>   renewalTimer.cancel();
> }
> appTokens.clear();
> allTokens.clear();
> this.renewerService.shutdown();
> {code}
> {code:java}
> 2018-02-21 11:18:16,253  FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.util.concurrent.RejectedExecutionException: Task 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable@39bddaf2
>  rejected from java.util.concurrent.ThreadPoolExecutor@5f71637b[Terminated, 
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 15487]
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.processDelegationTokenRenewerEvent(DelegationTokenRenewer.java:196)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.applicationFinished(DelegationTokenRenewer.java:734)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.finishApplication(RMAppManager.java:199)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:424)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:65)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:177)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> What I think is going on here is that the {{serviceStop}} method is not 
> setting the {{isServiceStarted}} flag to 'false'.
> Please update so that the {{serviceStop}} method grabs the 
> {{serviceStateLock}} and sets {{isServiceStarted}} to _false_, before 
> shutting down the {{renewerService}} thread pool, to avoid this condition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8220) Tensorflow yarn spec file to add to native service examples

2018-05-31 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8220:
-
Description: 
Tensorflow could be run on YARN and could leverage YARN's distributed features.

This spec fill will help to run Tensorflow on yarn with GPU/docker

  was:
Tensorflow could be run on YARN and could leverage YARN's distributed features.

This spec fill will help to run tf 1.3/1.4 on yarn with cuda8.


> Tensorflow yarn spec file to add to native service examples
> ---
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8384) stdout, stderr logs of a Native Service container is coming with group as nobody

2018-05-31 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16497406#comment-16497406
 ] 

Wangda Tan commented on YARN-8384:
--

Thanks [~eyang], will commit the patch by tomorrow if no objections.

> stdout, stderr logs of a Native Service container is coming with group as 
> nobody
> 
>
> Key: YARN-8384
> URL: https://issues.apache.org/jira/browse/YARN-8384
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Eric Yang
>Priority: Critical
>  Labels: docker
> Attachments: YARN-8384.001.patch
>
>
> When {{yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users}} 
> is set to true, and 
> {{yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user}} is 
> set to nobody.
> This will cause the docker to run as nobody:nobody in yarn mode.
> The log files will be initialized as nobody:nobody:
> {noformat}
> rw-rr- 1 nobody hadoop 354 May 31 17:33 container-localizer-syslog
> rw-rr- 1 nobody hadoop 1042 May 31 17:35 directory.info
> rw-r 1 nobody hadoop 4944 May 31 17:35 launch_container.sh
> rw-rr- 1 nobody hadoop 440 May 31 17:35 prelaunch.err
> rw-rr- 1 nobody hadoop 100 May 31 17:35 prelaunch.out
> rw-r 1 nobody nobody 18733 May 31 17:37 stderr.txt
> rw-r 1 nobody nobody 400 May 31 17:35 stdout.txt
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7962) Race Condition When Stopping DelegationTokenRenewer

2018-05-31 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16497368#comment-16497368
 ] 

Wangda Tan commented on YARN-7962:
--

The failed tests happened in other JIRAs as well: 
https://builds.apache.org/job/PreCommit-YARN-Build/20853/testReport/ 

If everybody agrees, I will commit the patch by tomorrow.

> Race Condition When Stopping DelegationTokenRenewer
> ---
>
> Key: YARN-7962
> URL: https://issues.apache.org/jira/browse/YARN-7962
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: BELUGA BEHR
>Priority: Critical
> Attachments: YARN-7962.1.patch, YARN-7962.2.patch, YARN-7962.3.patch, 
> YARN-7962.4.patch, YARN-7962.6.patch
>
>
> [https://github.com/apache/hadoop/blob/69fa81679f59378fd19a2c65db8019393d7c05a2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java]
> {code:java}
>   private ThreadPoolExecutor renewerService;
>   private void processDelegationTokenRenewerEvent(
>   DelegationTokenRenewerEvent evt) {
> serviceStateLock.readLock().lock();
> try {
>   if (isServiceStarted) {
> renewerService.execute(new DelegationTokenRenewerRunnable(evt));
>   } else {
> pendingEventQueue.add(evt);
>   }
> } finally {
>   serviceStateLock.readLock().unlock();
> }
>   }
>   @Override
>   protected void serviceStop() {
> if (renewalTimer != null) {
>   renewalTimer.cancel();
> }
> appTokens.clear();
> allTokens.clear();
> this.renewerService.shutdown();
> {code}
> {code:java}
> 2018-02-21 11:18:16,253  FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.util.concurrent.RejectedExecutionException: Task 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable@39bddaf2
>  rejected from java.util.concurrent.ThreadPoolExecutor@5f71637b[Terminated, 
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 15487]
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.processDelegationTokenRenewerEvent(DelegationTokenRenewer.java:196)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.applicationFinished(DelegationTokenRenewer.java:734)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.finishApplication(RMAppManager.java:199)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:424)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:65)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:177)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> What I think is going on here is that the {{serviceStop}} method is not 
> setting the {{isServiceStarted}} flag to 'false'.
> Please update so that the {{serviceStop}} method grabs the 
> {{serviceStateLock}} and sets {{isServiceStarted}} to _false_, before 
> shutting down the {{renewerService}} thread pool, to avoid this condition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8349) Remove YARN registry entries when a service is killed by the RM

2018-05-31 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16497362#comment-16497362
 ] 

Wangda Tan commented on YARN-8349:
--

Gotcha, make sense to me, thanks [~billie.rinaldi]! 

> Remove YARN registry entries when a service is killed by the RM
> ---
>
> Key: YARN-8349
> URL: https://issues.apache.org/jira/browse/YARN-8349
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Shane Kumpf
>Assignee: Billie Rinaldi
>Priority: Major
> Attachments: YARN-8349.1.patch, YARN-8349.2.patch
>
>
> As the title states, when a service is killed by the RM (for exceeding its 
> lifetime for example), the YARN registry entries should be cleaned up.
> Without cleanup, DNS can contain multiple hostnames for a single IP address 
> in the case where IPs are reused. This impacts reverse lookups, which breaks 
> services, such as kerberos, that depend on those lookups.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8384) stdout, stderr logs of a Native Service container is coming with group as nobody

2018-05-31 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16497358#comment-16497358
 ] 

Wangda Tan commented on YARN-8384:
--

Thanks [~eyang] for working on this patch.

Discussed with [~eyang], the patch looks good and there's no backward 
incompatible change. 

[~eyang], have u done any verification of the patch? I will commit the patch if 
it is verified.

> stdout, stderr logs of a Native Service container is coming with group as 
> nobody
> 
>
> Key: YARN-8384
> URL: https://issues.apache.org/jira/browse/YARN-8384
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Eric Yang
>Priority: Critical
>  Labels: docker
> Attachments: YARN-8384.001.patch
>
>
> When {{yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users}} 
> is set to true, and 
> {{yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user}} is 
> set to nobody.
> This will cause the docker to run as nobody:nobody in yarn mode.
> The log files will be initialized as nobody:nobody:
> {noformat}
> rw-rr- 1 nobody hadoop 354 May 31 17:33 container-localizer-syslog
> rw-rr- 1 nobody hadoop 1042 May 31 17:35 directory.info
> rw-r 1 nobody hadoop 4944 May 31 17:35 launch_container.sh
> rw-rr- 1 nobody hadoop 440 May 31 17:35 prelaunch.err
> rw-rr- 1 nobody hadoop 100 May 31 17:35 prelaunch.out
> rw-r 1 nobody nobody 18733 May 31 17:37 stderr.txt
> rw-r 1 nobody nobody 400 May 31 17:35 stdout.txt
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    3   4   5   6   7   8   9   10   11   12   >