[ https://issues.apache.org/jira/browse/HADOOP-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558036#action_12558036 ]
Bryan Duxbury commented on HADOOP-2500: --------------------------------------- So, we should: * Change the "no good" message to something a tad more descriptive, like "assignment of region is invalid" * Enumerate the known ways that a RegionServer can fail to serve a region, trap those problems, and figure out what responses we'd like to give to those events > [HBase] Unreadable region kills region servers > ---------------------------------------------- > > Key: HADOOP-2500 > URL: https://issues.apache.org/jira/browse/HADOOP-2500 > Project: Hadoop > Issue Type: Bug > Components: contrib/hbase > Environment: CentOS 5 > Reporter: Chris Kline > Priority: Critical > > Backgound: The name node (also a DataNode and RegionServer) in our cluster > ran out of disk space. I created some space, restarted HDFS and fsck > reported corruption with an HBase file. I cleared up that corruption and > restarted HBase. I was still unable to read anything from HBase even though > HSFS was now healthy. > The following was gather from the log files. When HMaster starts up, it > finds a region that is no good (Key: 17_125736271): > 2007-12-24 09:07:14,342 DEBUG org.apache.hadoop.hbase.HMaster: Current > assignment of spider_pages,17_125736271,1198286140018 is no good > HMaster then assigns this region to RegionServer X.60: > 2007-12-24 09:07:17,126 INFO org.apache.hadoop.hbase.HMaster: assigning > region spider_pages,17_125736271,1198286140018 to server 10.100.11.60:60020 > 2007-12-24 09:07:20,152 DEBUG org.apache.hadoop.hbase.HMaster: Received > MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from > 10.100.11.60:60020 > The RegionServer has trouble reading that region (from the RegionServer log > on X.60); Note that the worker thread exits > 2007-12-24 09:07:22,611 DEBUG org.apache.hadoop.hbase.HStore: starting > spider_pages,17_125736271,1198286140018/meta (2062710340/meta with > reconstruction log: (/data/hbase1/hregion_2062710340/oldlogfile.log > 2007-12-24 09:07:22,620 DEBUG org.apache.hadoop.hbase.HStore: maximum > sequence id for hstore spider_pages,17_125736271,1198286140018/meta > (2062710340/meta) is 4549496 > 2007-12-24 09:07:22,622 ERROR org.apache.hadoop.hbase.HRegionServer: error > opening region spider_pages,17_125736271,1198286140018 > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at java.io.DataInputStream.readFully(DataInputStream.java:152) > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1360) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344) > at org.apache.hadoop.hbase.HStore.doReconstructionLog(HStore.java:697) > at org.apache.hadoop.hbase.HStore.<init>(HStore.java:632) > at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:288) > at > org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1211) > at > org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162) > at java.lang.Thread.run(Thread.java:619) > 2007-12-24 09:07:22,623 FATAL org.apache.hadoop.hbase.HRegionServer: > Unhandled exception > java.lang.NullPointerException > at > org.apache.hadoop.hbase.HRegionServer.reportClose(HRegionServer.java:1095) > at > org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1217) > at > org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162) > at java.lang.Thread.run(Thread.java:619) > 2007-12-24 09:07:22,623 INFO org.apache.hadoop.hbase.HRegionServer: worker > thread exiting > The HMaster then tries to assign the same region to X.60 again and fails. > The HMaster tries to assign the region to X.31 with the same result (X.31 > worker thread exits). > The file it is complaining about, > /data/hbase1/hregion_2062710340/oldlogfile.log, is a zero-length file in > HDFS. After deleting that file and restarting HBase, HBase appears to be > back to normal. > One thing I can't figure out is that the HMaster log show several entries > after the worker thread on X.60 has exited suggesting that the RegionServer > is talking with HMaster: > 2007-12-24 09:08:23,349 DEBUG org.apache.hadoop.hbase.HMaster: Received > MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from > 10.100.11.60:60020 > 2007-12-24 09:10:29,543 DEBUG org.apache.hadoop.hbase.HMaster: Received > MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from > 10.100.11.60:60020 > There is no corresponding entry in the RegionServer's log. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.