[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169328#comment-15169328
 ] 

Sangjin Lee commented on YARN-4736:
-----------------------------------

Both the thread dump and the HBase exception log are from the client process 
(NM side), correct? I believe so because both are showing HBase client stack 
traces. Could you confirm?

Since these are both from the client side, I am not sure what was going on on 
the HBase server side. Also, I'm not sure why HBase started refusing 
connections as evidenced by your exception log (that's why I assumed that HBase 
process might have gone away at that point).

Here is how I put together the sequence of events:
- at some point the HBase process starts refusing connections 
(hbaseException.log)
- the periodic flush gets trapped in this bad state, finally logging the 
{{RetriesExhaustedException}} after 36 minutes
{noformat}
2016-02-26 00:02:28,270 INFO 
org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager:
 The collector service for application_1456425026132_0001 was removed
2016-02-26 00:39:03,879 ERROR org.apache.hadoop.hbase.client.AsyncProcess: 
Failed to get region location 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=36, exceptions:
{noformat}
- some time after that, it looks like you issued a signal to stop the client 
process (NM)?
{noformat}
2016-02-26 01:09:19,799 ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: 
SIGTERM
{noformat}
- but the service stop fails to shut down the periodic flush task thread
{noformat}
2016-02-26 01:09:50,035 WARN 
org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager:
 failed to stop the flusher task in time. will still proceed to close the 
writer.
2016-02-26 01:09:50,035 INFO 
org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl: 
closing the entity table
{noformat}
- at this point the NM process is hung because the flush is in this stuck state 
while holding the {{BufferedMutatorImpl}} lock and the closing of the entity 
table needs to acquire that lock

That's why I thought there seems to be a HBase bug that is causing the flush 
operation to be wedged in this state. At least that explains why you were not 
able to shut down the collector (and therefore NM).

> Issues with HBaseTimelineWriterImpl
> -----------------------------------
>
>                 Key: YARN-4736
>                 URL: https://issues.apache.org/jira/browse/YARN-4736
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: YARN-2928
>            Reporter: Naganarasimha G R
>            Assignee: Vrushali C
>            Priority: Critical
>              Labels: yarn-2928-1st-milestone
>         Attachments: hbaseException.log, threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to