[jira] [Commented] (YARN-3933) FairScheduler: Multiple calls to completedContainer are not safe
[ https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774321#comment-17774321 ] dzcxzl commented on YARN-3933: -- Note that branch 2.8 is not consistent with branch 3.0 and branch2.9 fixes. branch 3.0 https://github.com/apache/hadoop/commit/646c6d6509f515b1373288869fb92807fa2ddc9b branch 2.8 https://github.com/apache/hadoop/commit/233252aa9a2f1e2dfc2a5206676b6b8e462d5e7e > FairScheduler: Multiple calls to completedContainer are not safe > > > Key: YARN-3933 > URL: https://issues.apache.org/jira/browse/YARN-3933 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Lavkesh Lahngir >Assignee: Shiwei Guo >Priority: Major > Labels: oct16-medium > Fix For: 2.8.0, 3.0.0-alpha4 > > Attachments: YARN-3933.001.patch, YARN-3933.002.patch, > YARN-3933.003.patch, YARN-3933.004.patch, YARN-3933.005.patch, > YARN-3933.006.patch, yarn-3933-branch-2.8.patch > > > In our cluster we are seeing available memory and cores being negative. > Initial inspection: > Scenario no. 1: > In capacity scheduler the method allocateContainersToNode() checks if > there are excess reservation of containers for an application, and they are > no longer needed then it calls queue.completedContainer() which causes > resources being negative. And they were never assigned in the first place. > I am still looking through the code. Can somebody suggest how to simulate > excess containers assignments ? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11487) WebAppProxyServlet log output request ip
dzcxzl created YARN-11487: - Summary: WebAppProxyServlet log output request ip Key: YARN-11487 URL: https://issues.apache.org/jira/browse/YARN-11487 Project: Hadoop YARN Issue Type: Improvement Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11476) Add NodeManager metric for event queue size of dispatcher
dzcxzl created YARN-11476: - Summary: Add NodeManager metric for event queue size of dispatcher Key: YARN-11476 URL: https://issues.apache.org/jira/browse/YARN-11476 Project: Hadoop YARN Issue Type: Improvement Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11467) RM failover may fail when the nodes.exclude-path file does not exist
dzcxzl created YARN-11467: - Summary: RM failover may fail when the nodes.exclude-path file does not exist Key: YARN-11467 URL: https://issues.apache.org/jira/browse/YARN-11467 Project: Hadoop YARN Issue Type: Bug Reporter: dzcxzl {code:java} Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation failed at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:788) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) ... 29 more Caused by: java.nio.file.NoSuchFileException: /tmp/non-existent-path-788aa744-1395-40ca-bdb5-f93bffc92cfb at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) at java.nio.file.Files.newByteChannel(Files.java:361) at java.nio.file.Files.newByteChannel(Files.java:407) at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384) at java.nio.file.Files.newInputStream(Files.java:152) at org.apache.hadoop.util.HostsFileReader.readFileToMap(HostsFileReader.java:126) at org.apache.hadoop.util.HostsFileReader.refreshInternal(HostsFileReader.java:214) at org.apache.hadoop.util.HostsFileReader.refresh(HostsFileReader.java:192) at org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.refreshHostsReader(NodesListManager.java:258) at org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.refreshNodes(NodesListManager.java:232) at org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.refreshNodes(NodesListManager.java:224) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshNodes(AdminService.java:490) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:778) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated YARN-3585: - Comment: was deleted (was: Leveldb will have problems using logger) > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Rohith Sharma K S >Priority: Critical > Labels: 2.6.1-candidate > Fix For: 2.6.1, 2.8.0, 2.7.1, 3.0.0-alpha1 > > Attachments: 0001-YARN-3585.patch, YARN-3585.patch > > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235357#comment-17235357 ] dzcxzl commented on YARN-3585: -- Leveldb will have problems using logger > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Rohith Sharma K S >Priority: Critical > Labels: 2.6.1-candidate > Fix For: 2.6.1, 2.8.0, 2.7.1, 3.0.0-alpha1 > > Attachments: 0001-YARN-3585.patch, YARN-3585.patch > > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10462) Configurable shutdown cleanup slop
[ https://issues.apache.org/jira/browse/YARN-10462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated YARN-10462: -- Attachment: YARN-10462.001.patch > Configurable shutdown cleanup slop > -- > > Key: YARN-10462 > URL: https://issues.apache.org/jira/browse/YARN-10462 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: dzcxzl >Priority: Trivial > Attachments: YARN-10462.001.patch > > > When stopping NM or decommission NM, stopping all containers, the waiting > time is composed of three values > sleep-delay-before-sigkill+process-kill-wait+SHUTDOWN_CLEANUP_SLOP_MS > (constant 1000) > yarn.nodemanager.sleep-delay-before-sigkill.ms=250 > yarn.nodemanager.process-kill-wait.ms=5000 > SHUTDOWN_CLEANUP_SLOP_MS=1000 > The parameters of sleep-delay-before-sigkill and process-kill-wait are the > time to kill a container/process. When there are too many container lists to > be killed, it is usually not completely killed. > We can make SHUTDOWN_CLEANUP_SLOP_MS a configurable parameter, so that in > some scenarios, we can wait as long as possible to kill all containers to > complete. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10462) Configurable shutdown cleanup slop
dzcxzl created YARN-10462: - Summary: Configurable shutdown cleanup slop Key: YARN-10462 URL: https://issues.apache.org/jira/browse/YARN-10462 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 3.1.0 Reporter: dzcxzl When stopping NM or decommission NM, stopping all containers, the waiting time is composed of three values sleep-delay-before-sigkill+process-kill-wait+SHUTDOWN_CLEANUP_SLOP_MS (constant 1000) yarn.nodemanager.sleep-delay-before-sigkill.ms=250 yarn.nodemanager.process-kill-wait.ms=5000 SHUTDOWN_CLEANUP_SLOP_MS=1000 The parameters of sleep-delay-before-sigkill and process-kill-wait are the time to kill a container/process. When there are too many container lists to be killed, it is usually not completely killed. We can make SHUTDOWN_CLEANUP_SLOP_MS a configurable parameter, so that in some scenarios, we can wait as long as possible to kill all containers to complete. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org