[ 
https://issues.apache.org/jira/browse/HADOOP-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558036#action_12558036
 ] 

Bryan Duxbury commented on HADOOP-2500:
---------------------------------------

So, we should:

 * Change the "no good" message to something a tad more descriptive, like 
"assignment of region is invalid"
 * Enumerate the known ways that a RegionServer can fail to serve a region, 
trap those problems, and figure out what responses we'd like to give to those 
events
 

> [HBase] Unreadable region kills region servers
> ----------------------------------------------
>
>                 Key: HADOOP-2500
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2500
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>         Environment: CentOS 5
>            Reporter: Chris Kline
>            Priority: Critical
>
> Backgound: The name node (also a DataNode and RegionServer) in our cluster 
> ran out of disk space.  I created some space, restarted HDFS and fsck 
> reported corruption with an HBase file.  I cleared up that corruption and 
> restarted HBase.  I was still unable to read anything from HBase even though 
> HSFS was now healthy.
> The following was gather from the log files.  When HMaster starts up, it 
> finds a region that is no good (Key: 17_125736271):
> 2007-12-24 09:07:14,342 DEBUG org.apache.hadoop.hbase.HMaster: Current 
> assignment of spider_pages,17_125736271,1198286140018 is no good
> HMaster then assigns this region to RegionServer X.60:
> 2007-12-24 09:07:17,126 INFO org.apache.hadoop.hbase.HMaster: assigning 
> region spider_pages,17_125736271,1198286140018 to server 10.100.11.60:60020
> 2007-12-24 09:07:20,152 DEBUG org.apache.hadoop.hbase.HMaster: Received 
> MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 
> 10.100.11.60:60020
> The RegionServer has trouble reading that region (from the RegionServer log 
> on X.60); Note that the worker thread exits
> 2007-12-24 09:07:22,611 DEBUG org.apache.hadoop.hbase.HStore: starting 
> spider_pages,17_125736271,1198286140018/meta (2062710340/meta with 
> reconstruction log: (/data/hbase1/hregion_2062710340/oldlogfile.log
> 2007-12-24 09:07:22,620 DEBUG org.apache.hadoop.hbase.HStore: maximum 
> sequence id for hstore spider_pages,17_125736271,1198286140018/meta 
> (2062710340/meta) is 4549496
> 2007-12-24 09:07:22,622 ERROR org.apache.hadoop.hbase.HRegionServer: error 
> opening region spider_pages,17_125736271,1198286140018
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1360)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
>         at org.apache.hadoop.hbase.HStore.doReconstructionLog(HStore.java:697)
>         at org.apache.hadoop.hbase.HStore.<init>(HStore.java:632)
>         at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:288)
>         at 
> org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1211)
>         at 
> org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 FATAL org.apache.hadoop.hbase.HRegionServer: 
> Unhandled exception
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hbase.HRegionServer.reportClose(HRegionServer.java:1095)
>         at 
> org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1217)
>         at 
> org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 INFO org.apache.hadoop.hbase.HRegionServer: worker 
> thread exiting
> The HMaster then tries to assign the same region to X.60 again and fails.  
> The HMaster tries to assign the region to X.31 with the same result (X.31 
> worker thread exits).
> The file it is complaining about, 
> /data/hbase1/hregion_2062710340/oldlogfile.log, is a zero-length file in 
> HDFS.  After deleting that file and restarting HBase, HBase appears to be 
> back to normal.
> One thing I can't figure out is that the HMaster log show several entries 
> after the worker thread on X.60 has exited suggesting that the RegionServer 
> is talking with HMaster:
> 2007-12-24 09:08:23,349 DEBUG org.apache.hadoop.hbase.HMaster: Received 
> MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 
> 10.100.11.60:60020
> 2007-12-24 09:10:29,543 DEBUG org.apache.hadoop.hbase.HMaster: Received 
> MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 
> 10.100.11.60:60020
> There is no corresponding entry in the RegionServer's log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to