[ https://issues.apache.org/jira/browse/HBASE-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450306#comment-13450306 ]
stack commented on HBASE-3814: ------------------------------ I think the basic idea of a kill switch if the RS is stuck going down is a good one. Lets open new issue if we see this happen again (Even if the scenario as the above described one seems to be, it seems like a good safety mechanism to have). > force regionserver to halt > -------------------------- > > Key: HBASE-3814 > URL: https://issues.apache.org/jira/browse/HBASE-3814 > Project: HBase > Issue Type: Bug > Reporter: Prakash Khemani > > Once abort() on a regionserver is called we should have a timeout thread that > does Runtime.halt() if the rs gets stuck somewhere during abort processing. > === > Pumahbase132 has following the logs .. the dfsclient is not able to set up a > write pipeline successfully ... it tries to abort ... but while aborting it > gets stuck. I know there is a check that if we are aborting because > filesystem is closed then we should not try to flush the logs while aborting. > But in this case the fs is up and running, just that it is not functioning. > 2011-04-21 23:48:07,082 INFO org.apache.hadoop.hdfs.DFSClient: Exception in > createBlockOutputStream 10.38.131.53:50010 for file > /PUMAHBASE002-SNC5-HBASE/.logs/pumahbase132.snc5.facebook.com,60020,1303450732026/pumahbase132.snc5.facebook.com%3A60020.1303450732280java.io.IOException: > Bad connect ack with firstBadLink 10.38.133.33:50010 > 2011-04-21 23:48:07,082 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning > block blk_-8967376451767492285_6537229 for file > /PUMAHBASE002-SNC5-HBASE/.logs/pumahbase132.snc5.facebook.com,60020,1303450732026/pumahbase132.snc5.facebook.com%3A60020.1303450732280 > 2011-04-21 23:48:07,125 INFO org.apache.hadoop.hdfs.DFSClient: Exception in > createBlockOutputStream 10.38.131.53:50010 for file > /PUMAHBASE002-SNC5-HBASE/.logs/pumahbase132.snc5.facebook.com,60020,1303450732026/pumahbase132.snc5.facebook.com%3A60020.1303450732280java.io.IOException: > Bad connect ack with firstBadLink 10.38.134.59:50010 > 2011-04-21 23:48:07,125 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning > block blk_7172251852699100447_6537229 for file > /PUMAHBASE002-SNC5-HBASE/.logs/pumahbase132.snc5.facebook.com,60020,1303450732026/pumahbase132.snc5.facebook.com%3A60020.1303450732280 > > 2011-04-21 23:48:07,169 INFO org.apache.hadoop.hdfs.DFSClient: Exception in > createBlockOutputStream 10.38.131.53:50010 for file > /PUMAHBASE002-SNC5-HBASE/.logs/pumahbase132.snc5.facebook.com,60020,1303450732026/pumahbase132.snc5.facebook.com%3A60020.1303450732280java.io.IOException: > Bad connect ack with firstBadLink 10.38.134.53:50010 > 2011-04-21 23:48:07,169 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning > block blk_-9153204772467623625_6537229 for file > /PUMAHBASE002-SNC5-HBASE/.logs/pumahbase132.snc5.facebook.com,60020,1303450732026/pumahbase132.snc5.facebook.com%3A60020.1303450732280 > 2011-04-21 23:48:07,213 INFO org.apache.hadoop.hdfs.DFSClient: Exception in > createBlockOutputStream 10.38.131.53:50010 for file > /PUMAHBASE002-SNC5-HBASE/.logs/pumahbase132.snc5.facebook.com,60020,1303450732026/pumahbase132.snc5.facebook.com%3A60020.1303450732280java.io.IOException: > Bad connect ack with firstBadLink 10.38.134.49:50010 > 2011-04-21 23:48:07,213 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning > block blk_-2513098940934276625_6537229 for file > /PUMAHBASE002-SNC5-HBASE/.logs/pumahbase132.snc5.facebook.com,60020,1303450732026/pumahbase132.snc5.facebook.com%3A60020.1303450732280 > 2011-04-21 23:48:07,214 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer > Exception: java.io.IOException: Unable to create new block. > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3560) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2700(DFSClient.java:2720) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2977) > 2011-04-21 23:48:07,214 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery > for block blk_-2513098940934276625_6537229 bad datanode[1] nodes == null > 2011-04-21 23:48:07,214 WARN org.apache.hadoop.hdfs.DFSClient: Could not get > block locations. Source file > "/PUMAHBASE002-SNC5-HBASE/.logs/pumahbase132.snc5.facebook.com,60020,1303450732026/pumahbase132.snc5.facebook.com%3A60020.1303450732280" > - Aborting... > 2011-04-21 23:48:07,216 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: > Could not append. Requesting close of hlog > And then the RS gets stuck trying to roll the logs ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira