[jira] Commented: (HADOOP-2500) [HBase] Unreadable region kills region servers

Jim Kellerman (JIRA) Fri, 11 Jan 2008 09:18:01 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558032#action_12558032
 ]


Jim Kellerman commented on HADOOP-2500:
---------------------------------------

Bryan Duxbury wrote:
> At the very least, we should not assign a region to a region server if it is 
> detected as "no good".

That is an unfortunate wording of a log message in the Master. It is saying 
that the current 
assignment of the region is no good because the information it read from the 
meta region
had a server or start code that did not match a known server. It does not mean 
that the
master thinks the region itself is no good.

> Also, if a RegionServer tries to access a region and it has difficulties, it 
> should report to the
> master that it can't read the region, and the master should stop trying to 
> serve it.
> From a more general standpoint, maybe when a bad region is detected, its 
> files should be 
> moved to a different location and generally excluded from the cluster. This 
> would allow you to 
> recover from problems better.

Yes, we absolutely need to do something, just not sure exactly what yet.

One thing for certain. zero length files should be ignored/deleted.


> [HBase] Unreadable region kills region servers
> ----------------------------------------------
>
>                 Key: HADOOP-2500
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2500
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>         Environment: CentOS 5
>            Reporter: Chris Kline
>            Priority: Critical
>
> Backgound: The name node (also a DataNode and RegionServer) in our cluster 
> ran out of disk space.  I created some space, restarted HDFS and fsck 
> reported corruption with an HBase file.  I cleared up that corruption and 
> restarted HBase.  I was still unable to read anything from HBase even though 
> HSFS was now healthy.
> The following was gather from the log files.  When HMaster starts up, it 
> finds a region that is no good (Key: 17_125736271):
> 2007-12-24 09:07:14,342 DEBUG org.apache.hadoop.hbase.HMaster: Current 
> assignment of spider_pages,17_125736271,1198286140018 is no good
> HMaster then assigns this region to RegionServer X.60:
> 2007-12-24 09:07:17,126 INFO org.apache.hadoop.hbase.HMaster: assigning 
> region spider_pages,17_125736271,1198286140018 to server 10.100.11.60:60020
> 2007-12-24 09:07:20,152 DEBUG org.apache.hadoop.hbase.HMaster: Received 
> MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 
> 10.100.11.60:60020
> The RegionServer has trouble reading that region (from the RegionServer log 
> on X.60); Note that the worker thread exits
> 2007-12-24 09:07:22,611 DEBUG org.apache.hadoop.hbase.HStore: starting 
> spider_pages,17_125736271,1198286140018/meta (2062710340/meta with 
> reconstruction log: (/data/hbase1/hregion_2062710340/oldlogfile.log
> 2007-12-24 09:07:22,620 DEBUG org.apache.hadoop.hbase.HStore: maximum 
> sequence id for hstore spider_pages,17_125736271,1198286140018/meta 
> (2062710340/meta) is 4549496
> 2007-12-24 09:07:22,622 ERROR org.apache.hadoop.hbase.HRegionServer: error 
> opening region spider_pages,17_125736271,1198286140018
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1360)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
>         at org.apache.hadoop.hbase.HStore.doReconstructionLog(HStore.java:697)
>         at org.apache.hadoop.hbase.HStore.<init>(HStore.java:632)
>         at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:288)
>         at 
> org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1211)
>         at 
> org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 FATAL org.apache.hadoop.hbase.HRegionServer: 
> Unhandled exception
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hbase.HRegionServer.reportClose(HRegionServer.java:1095)
>         at 
> org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1217)
>         at 
> org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
>         at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 INFO org.apache.hadoop.hbase.HRegionServer: worker 
> thread exiting
> The HMaster then tries to assign the same region to X.60 again and fails.  
> The HMaster tries to assign the region to X.31 with the same result (X.31 
> worker thread exits).
> The file it is complaining about, 
> /data/hbase1/hregion_2062710340/oldlogfile.log, is a zero-length file in 
> HDFS.  After deleting that file and restarting HBase, HBase appears to be 
> back to normal.
> One thing I can't figure out is that the HMaster log show several entries 
> after the worker thread on X.60 has exited suggesting that the RegionServer 
> is talking with HMaster:
> 2007-12-24 09:08:23,349 DEBUG org.apache.hadoop.hbase.HMaster: Received 
> MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 
> 10.100.11.60:60020
> 2007-12-24 09:10:29,543 DEBUG org.apache.hadoop.hbase.HMaster: Received 
> MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from 
> 10.100.11.60:60020
> There is no corresponding entry in the RegionServer's log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2500) [HBase] Unreadable region kills region servers

Reply via email to