[jira] [Updated] (YARN-10354) deadlock in ContainerMetrics and MetricsSystemImpl

2020-07-16 Thread Lee young gon (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lee young gon updated YARN-10354:
-
Description: 
Could not get information about jmx in nodemanager. and I found deadlock 
through thread dump.

Below is the deadlock threads.
{code:java}
"Timer for 'NodeManager' metrics system" - Thread t@42
   java.lang.Thread.State: BLOCKED
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.getMetrics(ContainerMetrics.java:235)
- waiting to lock <7668d6f0> (a 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics)
 owned by "NM ContainerManager dispatcher" t@299
at 
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.getMetrics(MetricsSourceAdapter.java:200)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.snapshotMetrics(MetricsSystemImpl.java:419)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.sampleMetrics(MetricsSystemImpl.java:406)
- locked <3b956878> (a 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.onTimerEvent(MetricsSystemImpl.java:381)
- locked <3b956878> (a 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl$4.run(MetricsSystemImpl.java:368)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)   Locked ownable 
synchronizers:
- None



"NM ContainerManager dispatcher" - Thread t@299
   java.lang.Thread.State: BLOCKED
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.unregisterSource(MetricsSystemImpl.java:247)
- waiting to lock <3b956878> (a 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl) owned by "Timer for 
'NodeManager' metrics system" t@42
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.unregisterContainerMetrics(ContainerMetrics.java:228)
- locked <4e31c3ec> (a java.lang.Class)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.finished(ContainerMetrics.java:255)
- locked <7668d6f0> (a 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl.updateContainerMetrics(ContainersMonitorImpl.java:813)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl.onStopMonitoringContainer(ContainersMonitorImpl.java:935)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl.handle(ContainersMonitorImpl.java:900)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl.handle(ContainersMonitorImpl.java:57)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:745)   Locked ownable synchronizers:
- None


{code}
 

 

  was:
Could not get information about jmx in nodemanager. and I found deadlock 
through thread dump.

Below is the deadlock threads.
{code:java}
"Timer for 'NodeManager' metrics system" - Thread t@42
   java.lang.Thread.State: BLOCKED
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.getMetrics(ContainerMetrics.java:235)
- waiting to lock <7668d6f0> (a 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics)
 owned by "NM ContainerManager dispatcher" t@299
at 
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.getMetrics(MetricsSourceAdapter.java:200)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.snapshotMetrics(MetricsSystemImpl.java:419)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.sampleMetrics(MetricsSystemImpl.java:406)
- locked <3b956878> (a 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.onTimerEvent(MetricsSystemImpl.java:381)
- locked <3b956878> (a 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl$4.run(MetricsSystemImpl.java:368)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)   Locked ownable 
synchronizers:
- None
"NM ContainerManager dispatcher" - Thread t@299
   java.lang.Thread.State: BLOCKED
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.unregisterSource(MetricsSystemImpl.java:247)
- waiting to lock <3b956878> (a 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl) owned by "Timer for 
'NodeManager' metrics system" t@42
   

[jira] [Created] (YARN-10354) deadlock in ContainerMetrics and MetricsSystemImpl

2020-07-16 Thread Lee young gon (Jira)
Lee young gon created YARN-10354:


 Summary: deadlock in ContainerMetrics and MetricsSystemImpl
 Key: YARN-10354
 URL: https://issues.apache.org/jira/browse/YARN-10354
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: hadoop 3.1.2
Reporter: Lee young gon
 Attachments: full_thread_dump.txt

Could not get information about jmx in nodemanager. and I found deadlock 
through thread dump.

Below is the deadlock threads.
{code:java}
"Timer for 'NodeManager' metrics system" - Thread t@42
   java.lang.Thread.State: BLOCKED
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.getMetrics(ContainerMetrics.java:235)
- waiting to lock <7668d6f0> (a 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics)
 owned by "NM ContainerManager dispatcher" t@299
at 
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.getMetrics(MetricsSourceAdapter.java:200)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.snapshotMetrics(MetricsSystemImpl.java:419)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.sampleMetrics(MetricsSystemImpl.java:406)
- locked <3b956878> (a 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.onTimerEvent(MetricsSystemImpl.java:381)
- locked <3b956878> (a 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl$4.run(MetricsSystemImpl.java:368)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)   Locked ownable 
synchronizers:
- None
"NM ContainerManager dispatcher" - Thread t@299
   java.lang.Thread.State: BLOCKED
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.unregisterSource(MetricsSystemImpl.java:247)
- waiting to lock <3b956878> (a 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl) owned by "Timer for 
'NodeManager' metrics system" t@42
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.unregisterContainerMetrics(ContainerMetrics.java:228)
- locked <4e31c3ec> (a java.lang.Class)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.finished(ContainerMetrics.java:255)
- locked <7668d6f0> (a 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl.updateContainerMetrics(ContainersMonitorImpl.java:813)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl.onStopMonitoringContainer(ContainersMonitorImpl.java:935)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl.handle(ContainersMonitorImpl.java:900)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl.handle(ContainersMonitorImpl.java:57)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:745)   Locked ownable synchronizers:
- None

{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10343) Legacy RM UI should include labeled metrics for allocated, total, and reserved resources.

2020-07-16 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159584#comment-17159584
 ] 

Hadoop QA commented on YARN-10343:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
31s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m  1s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
41s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
39s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 31s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 1 new + 17 unchanged - 0 fixed = 18 total (was 17) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 14s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
43s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 93m  2s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
34s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}159m 11s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26278/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10343 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13007813/YARN-10343.001.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux 30db49811734 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 
10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / cc71d50b219 |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 |
| 

[jira] [Commented] (YARN-10353) Log vcores used and cumulative cpu in containers monitor

2020-07-16 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159543#comment-17159543
 ] 

Hadoop QA commented on YARN-10353:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
56s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 
 3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 39s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
21s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
19s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 26s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
25s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 22m  
5s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
33s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 89m 24s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26277/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10353 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13007810/YARN-10353.001.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux c239910f5afc 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 
10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / cc71d50b219 |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/26277/testReport/ |
| Max. process+thread count | 332 (vs. ulimit of 5500) |
| 

[jira] [Commented] (YARN-10343) Legacy RM UI should include labeled metrics for allocated, total, and reserved resources.

2020-07-16 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159527#comment-17159527
 ] 

Eric Payne commented on YARN-10343:
---

Thanks again [~Jim_Brennan] and [~jhung] for your reviews.

I have attached patch 001 for trunk.
- This patch renamed the MB variables to indicate they are bytes, not megabytes.
- I left the code as it is wrt subtracting reserved resources from used 
resources:
{code:title=MetricsOverviewTable#render}
+  usedMemoryBytes -= reservedMemoryBytes;
+  usedVCores -= reservedVCores;
{code}
-- I didn't want to make changes to the various allocateResource paths that 
call {{ResourceUsage#incUsed}}.
- I was able to use the root queue's {{ParentQueue#getNumContainers}} to get 
all labeled and non-labeled containers. However, it also includes reserved 
containers. In this case, it may be fine since there is not a separate 
"Reserved Containers" field.

> Legacy RM UI should include labeled metrics for allocated, total, and 
> reserved resources.
> -
>
> Key: YARN-10343
> URL: https://issues.apache.org/jira/browse/YARN-10343
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2020-07-07 at 1.00.22 PM.png, Screen Shot 
> 2020-07-07 at 1.03.26 PM.png, YARN-10343.000.patch, YARN-10343.001.patch
>
>
> The current legacy RM UI only includes resources metrics for the default 
> partition. If a cluster has labeled nodes, those are not included in the 
> resource metrics for allocated, total, and reserved resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10343) Legacy RM UI should include labeled metrics for allocated, total, and reserved resources.

2020-07-16 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10343:
--
Attachment: YARN-10343.001.patch

> Legacy RM UI should include labeled metrics for allocated, total, and 
> reserved resources.
> -
>
> Key: YARN-10343
> URL: https://issues.apache.org/jira/browse/YARN-10343
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2020-07-07 at 1.00.22 PM.png, Screen Shot 
> 2020-07-07 at 1.03.26 PM.png, YARN-10343.000.patch, YARN-10343.001.patch
>
>
> The current legacy RM UI only includes resources metrics for the default 
> partition. If a cluster has labeled nodes, those are not included in the 
> resource metrics for allocated, total, and reserved resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10353) Log vcores used and cumulative cpu in containers monitor

2020-07-16 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10353:
--

 Summary: Log vcores used and cumulative cpu in containers monitor
 Key: YARN-10353
 URL: https://issues.apache.org/jira/browse/YARN-10353
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.4.0
Reporter: Jim Brennan
Assignee: Jim Brennan


We currently log the percentage/cpu and percentage/cpus-used-by-yarn in the 
Containers Monitor log. It would be useful to also log vcores used vs vcores 
assigned, and total accumulated CPU time.

For example, currently we have an audit log that looks like this:
{noformat}
2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
(ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
CPU/core:35.772625
{noformat}
The proposal is to add two more fields to show vCores and Cumulative CPU ms:
{noformat}
2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
(ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
CPU/core:35.772625 vCores:2/1 CPU-ms:4180
{noformat}
This is a snippet of a log from one of our clusters running branch-2.8 with a 
similar change.
{noformat}
2020-07-16 21:00:02,240 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 5267 for container-id 
container_e04_1594079801456_1397450_01_001992: 1.6 GB of 2.5 GB physical memory 
used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 18 of 10 CPU vCores 
used. Cumulative CPU time: 157410
2020-07-16 21:00:02,269 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 18801 for container-id 
container_e04_1594079801456_1390375_01_19: 413.2 MB of 2.5 GB physical 
memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 of 10 CPU 
vCores used. Cumulative CPU time: 113830
2020-07-16 21:00:02,298 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 5279 for container-id 
container_e04_1594079801456_1397450_01_001991: 2.2 GB of 2.5 GB physical memory 
used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 17 of 10 CPU vCores 
used. Cumulative CPU time: 128630
2020-07-16 21:00:02,339 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 24189 for container-id 
container_e04_1594079801456_1390430_01_000415: 392.7 MB of 2.5 GB physical 
memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 of 10 CPU 
vCores used. Cumulative CPU time: 96060
2020-07-16 21:00:02,367 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 6751 for container-id 
container_e04_1594079801456_1397923_01_003248: 1.3 GB of 3 GB physical memory 
used; 4.3 GB of 6.3 GB virtual memory used. CPU usage: 12 of 10 CPU vCores 
used. Cumulative CPU time: 116820
2020-07-16 21:00:02,396 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 12138 for container-id 
container_e04_1594079801456_1397760_01_44: 4.4 GB of 6 GB physical memory 
used; 6.9 GB of 12.6 GB virtual memory used. CPU usage: 15 of 10 CPU vCores 
used. Cumulative CPU time: 45900
2020-07-16 21:00:02,424 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 101918 for container-id 
container_e04_1594079801456_1391130_01_002378: 2.4 GB of 4 GB physical memory 
used; 5.8 GB of 8.4 GB virtual memory used. CPU usage: 13 of 10 CPU vCores 
used. Cumulative CPU time: 2572390
2020-07-16 21:00:02,456 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 26596 for container-id 
container_e04_1594079801456_1390446_01_000665: 418.6 MB of 2.5 GB physical 
memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 of 10 CPU 
vCores used. Cumulative CPU time: 101210
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments

2020-07-16 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159398#comment-17159398
 ] 

Hudson commented on YARN-10339:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18445 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18445/])
YARN-10339. Fix TimelineClient in NodeManager failing when Simple Http 
(pjoseph: rev cc71d50b219c1cc682b4185ea739b485e519501f)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/YarnClientImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineConnector.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilterForV1.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestYarnClient.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilterInitializer.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/timelineservice/security/TestTimelineAuthFilterForV2.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineAuthenticationFilterInitializer.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestYarnClientImpl.java


> Timeline Client in Nodemanager gets 403 errors when simple auth is used in 
> kerberos environments
> 
>
> Key: YARN-10339
> URL: https://issues.apache.org/jira/browse/YARN-10339
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10339.001.patch, YARN-10339.002.patch
>
>
> We get below errors in NodeManager logs whenever we set 
> yarn.timeline-service.http-authentication.type=simple in a cluster which has 
> kerberos enabled. There are use cases where simple auth is used only in 
> timeline server for convenience although kerberos is enabled.
> {code:java}
> 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl 
> (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline 
> server is not successful, HTTP error code: 403, Server response:
> {"exception":"ForbiddenException","message":"java.lang.Exception: The owner 
> of the posted timeline entities is not 
> set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}
> {code}
> This seems to affect the NM timeline publisher which uses 
> TimelineV2ClientImpl. Doing a simple auth directly to timeline service via 
> curl works fine. So this issue is in the authenticator configuration in 
> timeline client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments

2020-07-16 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159389#comment-17159389
 ] 

Prabhu Joseph commented on YARN-10339:
--

Have committed the  [^YARN-10339.002.patch]  to trunk. Will resolve this Jira.

> Timeline Client in Nodemanager gets 403 errors when simple auth is used in 
> kerberos environments
> 
>
> Key: YARN-10339
> URL: https://issues.apache.org/jira/browse/YARN-10339
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10339.001.patch, YARN-10339.002.patch
>
>
> We get below errors in NodeManager logs whenever we set 
> yarn.timeline-service.http-authentication.type=simple in a cluster which has 
> kerberos enabled. There are use cases where simple auth is used only in 
> timeline server for convenience although kerberos is enabled.
> {code:java}
> 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl 
> (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline 
> server is not successful, HTTP error code: 403, Server response:
> {"exception":"ForbiddenException","message":"java.lang.Exception: The owner 
> of the posted timeline entities is not 
> set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}
> {code}
> This seems to affect the NM timeline publisher which uses 
> TimelineV2ClientImpl. Doing a simple auth directly to timeline service via 
> curl works fine. So this issue is in the authenticator configuration in 
> timeline client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments

2020-07-16 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159383#comment-17159383
 ] 

Prabhu Joseph commented on YARN-10339:
--

Thanks [~tarunparimi] for the patch.

+1, will commit it shortly.

> Timeline Client in Nodemanager gets 403 errors when simple auth is used in 
> kerberos environments
> 
>
> Key: YARN-10339
> URL: https://issues.apache.org/jira/browse/YARN-10339
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10339.001.patch, YARN-10339.002.patch
>
>
> We get below errors in NodeManager logs whenever we set 
> yarn.timeline-service.http-authentication.type=simple in a cluster which has 
> kerberos enabled. There are use cases where simple auth is used only in 
> timeline server for convenience although kerberos is enabled.
> {code:java}
> 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl 
> (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline 
> server is not successful, HTTP error code: 403, Server response:
> {"exception":"ForbiddenException","message":"java.lang.Exception: The owner 
> of the posted timeline entities is not 
> set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}
> {code}
> This seems to affect the NM timeline publisher which uses 
> TimelineV2ClientImpl. Doing a simple auth directly to timeline service via 
> curl works fine. So this issue is in the authenticator configuration in 
> timeline client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-1741) XInclude support broken for YARN ResourceManager

2020-07-16 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159326#comment-17159326
 ] 

Jim Brennan edited comment on YARN-1741 at 7/16/20, 4:10 PM:
-

While reviewing changes we made for our internal branch-2.8, I came across this 
one.
We have an internal fix for this that we have been running with for quite some 
time.   I have verified that this is still broken in branch-2.8 - if I remove 
our internal fix, the resourcemanager still fails to load xi:include files.

We are moving to branch-2.10 internally, so I was mainly interested in 
determining if we still needed this internal change.  As far as I can tell, it 
is not needed in branch-2.9 or later.   I believe this is mainly due to these 
changes: [HADOOP-14216], [HADOOP-14399], [HADOOP-15973], which are all included 
in branch-2.9 and later.

Since branch-2.8 is EOL, I propose that we close this as Won't Fix.  Although 
if there is interest, I can put up a patch for branch-2.8.




was (Author: jim_brennan):
While reviewing changes we made for our internal branch-2.8, I came across this 
one.
We have an internal fix for this that we have been running with for quite some 
time.   I have verified that this is still broken in branch-2.8 - if I remove 
our internal fix, the resourcemanager still fails to load xi:include files.

We are moving to branch-2.10 internally, so I was mainly interested in 
determining if we still needed this internal change.  As far as I can tell, it 
is not needed in branch-2.10 or later.   I believe this is mainly due to these 
changes: [HADOOP-14216], [HADOOP-14399], [HADOOP-15973], which are all included 
in branch-2.9 and later.

Since branch-2.8 is EOL, I propose that we close this as Won't Fix.  Although 
if there is interest, I can put up a patch for branch-2.8.



> XInclude support broken for YARN ResourceManager
> 
>
> Key: YARN-1741
> URL: https://issues.apache.org/jira/browse/YARN-1741
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Eric Sirianni
>Assignee: Xuan Gong
>Priority: Critical
>  Labels: regression
>
> The XInclude support in Hadoop configuration files (introduced via 
> HADOOP-4944) was broken by the recent {{ConfigurationProvider}} changes to 
> YARN ResourceManager.  Specifically, YARN-1459 and, more generally, the 
> YARN-1611 family of JIRAs for ResourceManager HA.
> The issue is that {{ConfigurationProvider}} provides a raw {{InputStream}} as 
> a {{Configuration}} resource for what was previously a {{Path}}-based 
> resource.  
> For {{Path}} resources, the absolute file path is used as the {{systemId}} 
> for the {{DocumentBuilder.parse()}} call:
> {code}
>   } else if (resource instanceof Path) {  // a file resource
> ...
>   doc = parse(builder, new BufferedInputStream(
>   new FileInputStream(file)), ((Path)resource).toString());
> }
> {code}
> The {{systemId}} is used to resolve XIncludes (among other things):
> {code}
> /**
>  * Parse the content of the given InputStream as an
>  * XML document and return a new DOM Document object.
> ...
>  * @param systemId Provide a base for resolving relative URIs.
> ...
>  */
> public Document parse(InputStream is, String systemId)
> {code}
> However, for loading raw {{InputStream}} resources, the {{systemId}} is set 
> to {{null}}:
> {code}
>   } else if (resource instanceof InputStream) {
> doc = parse(builder, (InputStream) resource, null);
> {code}
> causing XInclude resolution to fail.
> In our particular environment, we make extensive use of XIncludes to 
> standardize common configuration parameters across multiple Hadoop clusters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1741) XInclude support broken for YARN ResourceManager

2020-07-16 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159326#comment-17159326
 ] 

Jim Brennan commented on YARN-1741:
---

While reviewing changes we made for our internal branch-2.8, I came across this 
one.
We have an internal fix for this that we have been running with for quite some 
time.   I have verified that this is still broken in branch-2.8 - if I remove 
our internal fix, the resourcemanager still fails to load xi:include files.

We are moving to branch-2.10 internally, so I was mainly interested in 
determining if we still needed this internal change.  As far as I can tell, it 
is not needed in branch-2.10 or later.   I believe this is mainly due to these 
changes: [HADOOP-14216], [HADOOP-14399], [HADOOP-15973], which are all included 
in branch-2.9 and later.

Since branch-2.8 is EOL, I propose that we close this as Won't Fix.  Although 
if there is interest, I can put up a patch for branch-2.8.



> XInclude support broken for YARN ResourceManager
> 
>
> Key: YARN-1741
> URL: https://issues.apache.org/jira/browse/YARN-1741
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Eric Sirianni
>Assignee: Xuan Gong
>Priority: Critical
>  Labels: regression
>
> The XInclude support in Hadoop configuration files (introduced via 
> HADOOP-4944) was broken by the recent {{ConfigurationProvider}} changes to 
> YARN ResourceManager.  Specifically, YARN-1459 and, more generally, the 
> YARN-1611 family of JIRAs for ResourceManager HA.
> The issue is that {{ConfigurationProvider}} provides a raw {{InputStream}} as 
> a {{Configuration}} resource for what was previously a {{Path}}-based 
> resource.  
> For {{Path}} resources, the absolute file path is used as the {{systemId}} 
> for the {{DocumentBuilder.parse()}} call:
> {code}
>   } else if (resource instanceof Path) {  // a file resource
> ...
>   doc = parse(builder, new BufferedInputStream(
>   new FileInputStream(file)), ((Path)resource).toString());
> }
> {code}
> The {{systemId}} is used to resolve XIncludes (among other things):
> {code}
> /**
>  * Parse the content of the given InputStream as an
>  * XML document and return a new DOM Document object.
> ...
>  * @param systemId Provide a base for resolving relative URIs.
> ...
>  */
> public Document parse(InputStream is, String systemId)
> {code}
> However, for loading raw {{InputStream}} resources, the {{systemId}} is set 
> to {{null}}:
> {code}
>   } else if (resource instanceof InputStream) {
> doc = parse(builder, (InputStream) resource, null);
> {code}
> causing XInclude resolution to fail.
> In our particular environment, we make extensive use of XIncludes to 
> standardize common configuration parameters across multiple Hadoop clusters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) MultiNode Placement assigns container on stopped NodeManagers

2020-07-16 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Summary: MultiNode Placement assigns container on stopped NodeManagers  
(was: MultiNode Placament assigns container on stopped NodeManagers)

> MultiNode Placement assigns container on stopped NodeManagers
> -
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) MultiNode Placament assigns container on stopped NodeManagers

2020-07-16 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Labels: capacityscheduler multi-node-placement  (was: )

> MultiNode Placament assigns container on stopped NodeManagers
> -
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) MultiNode Placament assigns container on stopped NodeManagers

2020-07-16 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Affects Version/s: 3.3.0

> MultiNode Placament assigns container on stopped NodeManagers
> -
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) MultiNode Placament assigns container on stopped NodeManagers

2020-07-16 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Affects Version/s: 3.4.0

> MultiNode Placament assigns container on stopped NodeManagers
> -
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10352) MultiNode Placament assigns container on stopped NodeManagers

2020-07-16 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10352:


 Summary: MultiNode Placament assigns container on stopped 
NodeManagers
 Key: YARN-10352
 URL: https://issues.apache.org/jira/browse/YARN-10352
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
Active Nodes will be still having those stopped nodes until NM Liveliness 
Monitor Expires after configured timeout 
(yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
Multi Node Placement assigns the containers on those nodes. They need to 
exclude the nodes which has not heartbeated for configured heartbeat interval 
(yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
Asynchronous Capacity Scheduler Threads. 
(CapacityScheduler#shouldSkipNodeSchedule)


*Repro:*

1. Enable Multi Node Placement 
(yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery Enabled  
(yarn.node.recovery.enabled)

2. Have only one NM running say worker0

3. Stop worker0 and start any other NM say worker1

4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
worker0.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10350) TestUserGroupMappingPlacementRule fails

2020-07-16 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10350:
--
Fix Version/s: 3.4.0

> TestUserGroupMappingPlacementRule fails
> ---
>
> Key: YARN-10350
> URL: https://issues.apache.org/jira/browse/YARN-10350
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Bilwa S T
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10350.001.patch, YARN-10350.002.patch
>
>
> TestUserGroupMappingPlacementRule fails on trunk:
> {noformat}
> [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule
> [ERROR] Tests run: 31, Failures: 1, Errors: 2, Skipped: 0, Time elapsed: 
> 2.662 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule
> [ERROR] 
> testResolvedQueueIsNotManaged(org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule)
>   Time elapsed: 0.03 s  <<< ERROR!
> java.lang.Exception: Unexpected exception, 
> expected but 
> was
>   at 
> org.junit.internal.runners.statements.ExpectException.evaluate(ExpectException.java:28)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Caused by: java.lang.AssertionError: Queue expected: but was:
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule.verifyQueueMapping(TestUserGroupMappingPlacementRule.java:236)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule.testResolvedQueueIsNotManaged(TestUserGroupMappingPlacementRule.java:516)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.ExpectException.evaluate(ExpectException.java:19)
>   ... 18 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org