[jira] [Created] (YARN-5403) yarn top command does not execute correct

2016-07-19 Thread gu-chi (JIRA)
gu-chi created YARN-5403:


 Summary: yarn top command does not execute correct
 Key: YARN-5403
 URL: https://issues.apache.org/jira/browse/YARN-5403
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.7.2
Reporter: gu-chi


when execute {{yarn top}}, I always get exception as below:
{quote}
16/07/19 19:55:12 ERROR cli.TopCLI: Could not fetch RM start time
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at 
org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:747)
at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:443)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:421)
YARN top - 19:55:13, up 17001d, 11:55, 0 active users, queue(s): root
{quote}

As I looked into it, the function {{getRMStartTime}} use HTTP as hardcoding no 
matter what is the {{yarn.http.policy}} setting, should consider if use HTTPS 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-3678) DelayedProcessKiller may kill other process other than container

2016-05-10 Thread gu-chi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gu-chi resolved YARN-3678.
--
Resolution: Duplicate

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0, 2.7.2
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-4536) DelayedProcessKiller may not work under heavy workload

2016-01-04 Thread gu-chi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gu-chi resolved YARN-4536.
--
Resolution: Not A Problem

As analyzed further, this is introduced by some custom modification, sorry if 
bother.

> DelayedProcessKiller may not work under heavy workload
> --
>
> Key: YARN-4536
> URL: https://issues.apache.org/jira/browse/YARN-4536
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: gu-chi
>
> I am now facing with orphan process of container. Here is the scenario:
> With heavy task load, the NM machine CPU usage can reach almost 100%. When 
> some container got event of kill, it will get  {{SIGTERM}} , and then the 
> parent process exit, leave the container process to OS. This container 
> process need handle some shutdown events or some logic, but hardly can get 
> CPU, we suppose to see a {{SIGKILL}} as there is {{DelayedProcessKiller}} 
> ,but the parent process which persisted as container pid no longer exist, so 
> the kill command can not reach the container process. This is how orphan 
> container process come.
> The orphan process do exit after some time, but the period can be very long, 
> and will make the OS status worse. As I observed, the period can be several 
> hours



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4536) DelayedProcessKiller may not work under heavy workload

2016-01-04 Thread gu-chi (JIRA)
gu-chi created YARN-4536:


 Summary: DelayedProcessKiller may not work under heavy workload
 Key: YARN-4536
 URL: https://issues.apache.org/jira/browse/YARN-4536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: gu-chi


I am now facing with orphan process of container. Here is the scenario:
With heavy task load, the NM machine CPU usage can reach almost 100%. When some 
container got event of kill, it will get  {{SIGTERM}} , and then the parent 
process exit, leave the container process to OS. This container process need 
handle some shutdown events or some logic, but hardly can get CPU, we suppose 
to see a {{SIGKILL}} as there is {{DelayedProcessKiller}} ,but the parent 
process which persisted as container pid no longer exist, so the kill command 
can not reach the container process. This is how orphan container process come.
The orphan process do exit after some time, but the period can be very long, 
and will make the OS status worse. As I observed, the period can be several 
hours



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4481) negative pending resource of queues lead to applications in accepted status inifnitly

2015-12-18 Thread gu-chi (JIRA)
gu-chi created YARN-4481:


 Summary: negative pending resource of queues lead to applications 
in accepted status inifnitly
 Key: YARN-4481
 URL: https://issues.apache.org/jira/browse/YARN-4481
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Affects Versions: 2.7.2
Reporter: gu-chi
Priority: Critical


Met a scenario of negative pending resource with capacity scheduler, in jmx, it 
shows:
{noformat}
"PendingMB" : -4096,
"PendingVCores" : -1,
"PendingContainers" : -1,
{noformat}
full jmx infomation attached.
this is not just a jmx UI issue, the actual pending resource of queue is also 
negative as I see the debug log of
bq. DEBUG | ResourceManager Event Processor | Skip this queue=root, because it 
doesn't need more resource, schedulingMode=RESPECT_PARTITION_EXCLUSIVITY 
node-partition= | ParentQueue.java
this lead to the {{NULL_ASSIGNMENT}}
The background is submitting hundreds of applications and consume all cluster 
resource and reservation happen. While running, network fault injected by some 
tool, injection types are delay,jitter
,repeat,packet loss and disorder. And then kill most of the applications 
submitted.

Anyone also facing negative pending resource, or have idea of how this happen?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3730) scheduler reserve more resource than required

2015-05-27 Thread gu-chi (JIRA)
gu-chi created YARN-3730:


 Summary: scheduler reserve more resource than required
 Key: YARN-3730
 URL: https://issues.apache.org/jira/browse/YARN-3730
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: gu-chi


Using capacity scheduler, environment is 3 NM and each has 9 vcores, I ran a 
spark task with 4 executors and each executor 5 cores, as suspected, only 1 
executor not able to start and will be reserved, but actually more containers 
are reserved. This way, I can not run some other smaller tasks. As I checked 
the capacity scheduler, the 'needContainers' method in LeafQueue.java has a 
computation of 'starvation', this cause the scenario of more container reserved 
than required, any idea or suggestion on this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-19 Thread gu-chi (JIRA)
gu-chi created YARN-3678:


 Summary: DelayedProcessKiller may kill other process other than 
container
 Key: YARN-3678
 URL: https://issues.apache.org/jira/browse/YARN-3678
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: gu-chi
Priority: Critical


Suppose one container finished, then it will do clean up, the PID file still 
exist and will trigger once singalContainer, this will kill the process with 
the pid in PID file, but as container already finished, so this PID may be 
occupied by other process, this may cause serious issue.
As I know, my NM was killed unexpectedly, what I described can be the cause. 
Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover

2015-04-23 Thread gu-chi (JIRA)
gu-chi created YARN-3536:


 Summary: ZK exception occur when updating AppAttempt status, then 
NPE thrown when RM do recover
 Key: YARN-3536
 URL: https://issues.apache.org/jira/browse/YARN-3536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.4.1
Reporter: gu-chi


Here is a scenario that Application status is FAILED/FINISHED but AppAttempt 
status is null, this cause NPE when doing recover with 
yarn.resourcemanager.work-preserving-recovery.enabled set to true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)