[jira] [Commented] (YARN-3933) FairScheduler: Multiple calls to completedContainer are not safe

2023-10-11 Thread dzcxzl (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774321#comment-17774321
 ] 

dzcxzl commented on YARN-3933:
--

Note that branch 2.8 is not consistent with branch 3.0 and branch2.9 fixes.

branch 3.0

https://github.com/apache/hadoop/commit/646c6d6509f515b1373288869fb92807fa2ddc9b

branch 2.8

https://github.com/apache/hadoop/commit/233252aa9a2f1e2dfc2a5206676b6b8e462d5e7e

> FairScheduler: Multiple calls to completedContainer are not safe
> 
>
> Key: YARN-3933
> URL: https://issues.apache.org/jira/browse/YARN-3933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Lavkesh Lahngir
>Assignee: Shiwei Guo
>Priority: Major
>  Labels: oct16-medium
> Fix For: 2.8.0, 3.0.0-alpha4
>
> Attachments: YARN-3933.001.patch, YARN-3933.002.patch, 
> YARN-3933.003.patch, YARN-3933.004.patch, YARN-3933.005.patch, 
> YARN-3933.006.patch, yarn-3933-branch-2.8.patch
>
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11487) WebAppProxyServlet log output request ip

2023-05-04 Thread dzcxzl (Jira)
dzcxzl created YARN-11487:
-

 Summary: WebAppProxyServlet log output request ip
 Key: YARN-11487
 URL: https://issues.apache.org/jira/browse/YARN-11487
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: dzcxzl






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11476) Add NodeManager metric for event queue size of dispatcher

2023-04-26 Thread dzcxzl (Jira)
dzcxzl created YARN-11476:
-

 Summary: Add  NodeManager metric for event queue size of dispatcher
 Key: YARN-11476
 URL: https://issues.apache.org/jira/browse/YARN-11476
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: dzcxzl






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11467) RM failover may fail when the nodes.exclude-path file does not exist

2023-04-18 Thread dzcxzl (Jira)
dzcxzl created YARN-11467:
-

 Summary: RM failover may fail when the nodes.exclude-path file 
does not exist
 Key: YARN-11467
 URL: https://issues.apache.org/jira/browse/YARN-11467
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: dzcxzl


{code:java}
Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
failed
    at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:788)
    at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
    ... 29 more
Caused by: java.nio.file.NoSuchFileException: 
/tmp/non-existent-path-788aa744-1395-40ca-bdb5-f93bffc92cfb
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at 
sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
    at java.nio.file.Files.newByteChannel(Files.java:361)
    at java.nio.file.Files.newByteChannel(Files.java:407)
    at 
java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
    at java.nio.file.Files.newInputStream(Files.java:152)
    at 
org.apache.hadoop.util.HostsFileReader.readFileToMap(HostsFileReader.java:126)
    at 
org.apache.hadoop.util.HostsFileReader.refreshInternal(HostsFileReader.java:214)
    at org.apache.hadoop.util.HostsFileReader.refresh(HostsFileReader.java:192)
    at 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.refreshHostsReader(NodesListManager.java:258)
    at 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.refreshNodes(NodesListManager.java:232)
    at 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.refreshNodes(NodesListManager.java:224)
    at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshNodes(AdminService.java:490)
    at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:778)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2020-11-19 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated YARN-3585:
-
Comment: was deleted

(was: Leveldb will have problems using logger)

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Rohith Sharma K S
>Priority: Critical
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1, 2.8.0, 2.7.1, 3.0.0-alpha1
>
> Attachments: 0001-YARN-3585.patch, YARN-3585.patch
>
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2020-11-19 Thread dzcxzl (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235357#comment-17235357
 ] 

dzcxzl commented on YARN-3585:
--

Leveldb will have problems using logger

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Rohith Sharma K S
>Priority: Critical
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1, 2.8.0, 2.7.1, 3.0.0-alpha1
>
> Attachments: 0001-YARN-3585.patch, YARN-3585.patch
>
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10462) Configurable shutdown cleanup slop

2020-10-15 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated YARN-10462:
--
Attachment: YARN-10462.001.patch

> Configurable shutdown cleanup slop
> --
>
> Key: YARN-10462
> URL: https://issues.apache.org/jira/browse/YARN-10462
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: dzcxzl
>Priority: Trivial
> Attachments: YARN-10462.001.patch
>
>
> When stopping NM or decommission NM, stopping all containers, the waiting 
> time is composed of three values 
> sleep-delay-before-sigkill+process-kill-wait+SHUTDOWN_CLEANUP_SLOP_MS 
> (constant 1000)
> yarn.nodemanager.sleep-delay-before-sigkill.ms=250
> yarn.nodemanager.process-kill-wait.ms=5000
> SHUTDOWN_CLEANUP_SLOP_MS=1000
> The parameters of sleep-delay-before-sigkill and process-kill-wait are the 
> time to kill a container/process. When there are too many container lists to 
> be killed, it is usually not completely killed.
> We can make SHUTDOWN_CLEANUP_SLOP_MS a configurable parameter, so that in 
> some scenarios, we can wait as long as possible to kill all containers to 
> complete.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10462) Configurable shutdown cleanup slop

2020-10-15 Thread dzcxzl (Jira)
dzcxzl created YARN-10462:
-

 Summary: Configurable shutdown cleanup slop
 Key: YARN-10462
 URL: https://issues.apache.org/jira/browse/YARN-10462
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 3.1.0
Reporter: dzcxzl


When stopping NM or decommission NM, stopping all containers, the waiting time 
is composed of three values 
sleep-delay-before-sigkill+process-kill-wait+SHUTDOWN_CLEANUP_SLOP_MS (constant 
1000)

yarn.nodemanager.sleep-delay-before-sigkill.ms=250
yarn.nodemanager.process-kill-wait.ms=5000
SHUTDOWN_CLEANUP_SLOP_MS=1000

The parameters of sleep-delay-before-sigkill and process-kill-wait are the time 
to kill a container/process. When there are too many container lists to be 
killed, it is usually not completely killed.

We can make SHUTDOWN_CLEANUP_SLOP_MS a configurable parameter, so that in some 
scenarios, we can wait as long as possible to kill all containers to complete.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org