Hi George,

     The symptoms of YARN-7163 are - RM UI shows old completed jobs, high
Heap and CPU Usage.

High CPU Usage usually happens during continous Full GC which will inturn
causes OOM if no more
heap available to allocate new objects. High CPU Usage could be a symptom
of High Heap Usage.

1. Can you check if the jobs shown as Running are already completed ones.

2. Heap Dump from RM when UI shows old completed jobs as Running will help
to prove -
it will match the image [1] where RMActiveServiceContext applications will
have completed
applications list.

Also check comment [2] and YARN-7065 (Dup of YARN-7163) which matches the
issue reported.

[1] https://issues.apache.org/jira/secure/attachment/12885607/suspect-1.png
[2]
https://issues.apache.org/jira/browse/YARN-7163?focusedCommentId=16158652&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16158652

Thanks,
Prabhu Joseph



On Tue, Apr 2, 2019 at 8:27 PM George Liaw <george.a.l...@gmail.com> wrote:

> Hi Prabhu,
>
> Unfortunately I don't believe that is the same issue we are seeing. We are
> experiencing high cpu usage and we are not getting OOM errors.
>
> Is there reason to believe they're the same issue?
>
>
> On Tue, Apr 2, 2019, 2:15 AM Prabhu Josephraj <pjos...@cloudera.com>
> wrote:
>
>> Hi George,
>>
>>     Have seen this issue - RM UI will show the old job list and the RM
>> process heap usage will be high. This is due to a Bug fixed by YARN-7163.
>> Can you test with patch from YARN-7163.
>>
>> Thanks,
>> Prabhu Joseph
>>
>>
>> On Tue, Apr 2, 2019 at 4:59 AM George Liaw <george.a.l...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Using Hadoop 2.7.2.
>>> Wondering if anyone's seen an issue before where every once in a while
>>> the resource manager gets into a weird state where the Applications
>>> dashboard shows jobs running, but there are no actual jobs running on the
>>> cluster. When this happens we'll see RM cpu usage flat-lining at very high
>>> levels (around 85%), but the datanodes/nodemanagers will have no load
>>> because of no jobs running. If we restart the RM and let it fail over to
>>> the stand-by, the cluster will go back to normal behavior and start running
>>> jobs again after 15-30 minutes.
>>>
>>> Bit of a strange situation - not entirely sure why the RM would fail to
>>> realize that the jobs have finished running and that the jobs sitting in
>>> accepted state are free to run. Also strange that the RM gets stuck at high
>>> cpu usage.
>>>
>>> If anyone can point me in the right direction on how to debug or resolve
>>> this, that would be much appreciated!
>>>
>>> --
>>> George A. Liaw
>>>
>>> (408) 318-7920
>>> george.a.l...@gmail.com
>>> LinkedIn <http://www.linkedin.com/in/georgeliaw/>
>>>
>>

Reply via email to