[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169328#comment-15169328 ]
Sangjin Lee commented on YARN-4736: ----------------------------------- Both the thread dump and the HBase exception log are from the client process (NM side), correct? I believe so because both are showing HBase client stack traces. Could you confirm? Since these are both from the client side, I am not sure what was going on on the HBase server side. Also, I'm not sure why HBase started refusing connections as evidenced by your exception log (that's why I assumed that HBase process might have gone away at that point). Here is how I put together the sequence of events: - at some point the HBase process starts refusing connections (hbaseException.log) - the periodic flush gets trapped in this bad state, finally logging the {{RetriesExhaustedException}} after 36 minutes {noformat} 2016-02-26 00:02:28,270 INFO org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager: The collector service for application_1456425026132_0001 was removed 2016-02-26 00:39:03,879 ERROR org.apache.hadoop.hbase.client.AsyncProcess: Failed to get region location org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions: {noformat} - some time after that, it looks like you issued a signal to stop the client process (NM)? {noformat} 2016-02-26 01:09:19,799 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM {noformat} - but the service stop fails to shut down the periodic flush task thread {noformat} 2016-02-26 01:09:50,035 WARN org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager: failed to stop the flusher task in time. will still proceed to close the writer. 2016-02-26 01:09:50,035 INFO org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl: closing the entity table {noformat} - at this point the NM process is hung because the flush is in this stuck state while holding the {{BufferedMutatorImpl}} lock and the closing of the entity table needs to acquire that lock That's why I thought there seems to be a HBase bug that is causing the flush operation to be wedged in this state. At least that explains why you were not able to shut down the collector (and therefore NM). > Issues with HBaseTimelineWriterImpl > ----------------------------------- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver > Affects Versions: YARN-2928 > Reporter: Naganarasimha G R > Assignee: Vrushali C > Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: hbaseException.log, threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)