[jira] [Created] (YARN-5403) yarn top command does not execute correct
gu-chi created YARN-5403: Summary: yarn top command does not execute correct Key: YARN-5403 URL: https://issues.apache.org/jira/browse/YARN-5403 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.2 Reporter: gu-chi when execute {{yarn top}}, I always get exception as below: {quote} 16/07/19 19:55:12 ERROR cli.TopCLI: Could not fetch RM start time java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) at org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:747) at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:443) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:421) YARN top - 19:55:13, up 17001d, 11:55, 0 active users, queue(s): root {quote} As I looked into it, the function {{getRMStartTime}} use HTTP as hardcoding no matter what is the {{yarn.http.policy}} setting, should consider if use HTTPS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gu-chi resolved YARN-3678. -- Resolution: Duplicate > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0, 2.7.2 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-4536) DelayedProcessKiller may not work under heavy workload
[ https://issues.apache.org/jira/browse/YARN-4536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gu-chi resolved YARN-4536. -- Resolution: Not A Problem As analyzed further, this is introduced by some custom modification, sorry if bother. > DelayedProcessKiller may not work under heavy workload > -- > > Key: YARN-4536 > URL: https://issues.apache.org/jira/browse/YARN-4536 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1 >Reporter: gu-chi > > I am now facing with orphan process of container. Here is the scenario: > With heavy task load, the NM machine CPU usage can reach almost 100%. When > some container got event of kill, it will get {{SIGTERM}} , and then the > parent process exit, leave the container process to OS. This container > process need handle some shutdown events or some logic, but hardly can get > CPU, we suppose to see a {{SIGKILL}} as there is {{DelayedProcessKiller}} > ,but the parent process which persisted as container pid no longer exist, so > the kill command can not reach the container process. This is how orphan > container process come. > The orphan process do exit after some time, but the period can be very long, > and will make the OS status worse. As I observed, the period can be several > hours -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4536) DelayedProcessKiller may not work under heavy workload
gu-chi created YARN-4536: Summary: DelayedProcessKiller may not work under heavy workload Key: YARN-4536 URL: https://issues.apache.org/jira/browse/YARN-4536 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.1 Reporter: gu-chi I am now facing with orphan process of container. Here is the scenario: With heavy task load, the NM machine CPU usage can reach almost 100%. When some container got event of kill, it will get {{SIGTERM}} , and then the parent process exit, leave the container process to OS. This container process need handle some shutdown events or some logic, but hardly can get CPU, we suppose to see a {{SIGKILL}} as there is {{DelayedProcessKiller}} ,but the parent process which persisted as container pid no longer exist, so the kill command can not reach the container process. This is how orphan container process come. The orphan process do exit after some time, but the period can be very long, and will make the OS status worse. As I observed, the period can be several hours -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4481) negative pending resource of queues lead to applications in accepted status inifnitly
gu-chi created YARN-4481: Summary: negative pending resource of queues lead to applications in accepted status inifnitly Key: YARN-4481 URL: https://issues.apache.org/jira/browse/YARN-4481 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Affects Versions: 2.7.2 Reporter: gu-chi Priority: Critical Met a scenario of negative pending resource with capacity scheduler, in jmx, it shows: {noformat} "PendingMB" : -4096, "PendingVCores" : -1, "PendingContainers" : -1, {noformat} full jmx infomation attached. this is not just a jmx UI issue, the actual pending resource of queue is also negative as I see the debug log of bq. DEBUG | ResourceManager Event Processor | Skip this queue=root, because it doesn't need more resource, schedulingMode=RESPECT_PARTITION_EXCLUSIVITY node-partition= | ParentQueue.java this lead to the {{NULL_ASSIGNMENT}} The background is submitting hundreds of applications and consume all cluster resource and reservation happen. While running, network fault injected by some tool, injection types are delay,jitter ,repeat,packet loss and disorder. And then kill most of the applications submitted. Anyone also facing negative pending resource, or have idea of how this happen? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3730) scheduler reserve more resource than required
gu-chi created YARN-3730: Summary: scheduler reserve more resource than required Key: YARN-3730 URL: https://issues.apache.org/jira/browse/YARN-3730 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: gu-chi Using capacity scheduler, environment is 3 NM and each has 9 vcores, I ran a spark task with 4 executors and each executor 5 cores, as suspected, only 1 executor not able to start and will be reserved, but actually more containers are reserved. This way, I can not run some other smaller tasks. As I checked the capacity scheduler, the 'needContainers' method in LeafQueue.java has a computation of 'starvation', this cause the scenario of more container reserved than required, any idea or suggestion on this? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3678) DelayedProcessKiller may kill other process other than container
gu-chi created YARN-3678: Summary: DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover
gu-chi created YARN-3536: Summary: ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover Key: YARN-3536 URL: https://issues.apache.org/jira/browse/YARN-3536 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.4.1 Reporter: gu-chi Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true -- This message was sent by Atlassian JIRA (v6.3.4#6332)