[
https://issues.apache.org/jira/browse/HADOOP-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558032#action_12558032
]
Jim Kellerman commented on HADOOP-2500:
---------------------------------------
Bryan Duxbury wrote:
> At the very least, we should not assign a region to a region server if it is
> detected as "no good".
That is an unfortunate wording of a log message in the Master. It is saying
that the current
assignment of the region is no good because the information it read from the
meta region
had a server or start code that did not match a known server. It does not mean
that the
master thinks the region itself is no good.
> Also, if a RegionServer tries to access a region and it has difficulties, it
> should report to the
> master that it can't read the region, and the master should stop trying to
> serve it.
> From a more general standpoint, maybe when a bad region is detected, its
> files should be
> moved to a different location and generally excluded from the cluster. This
> would allow you to
> recover from problems better.
Yes, we absolutely need to do something, just not sure exactly what yet.
One thing for certain. zero length files should be ignored/deleted.
> [HBase] Unreadable region kills region servers
> ----------------------------------------------
>
> Key: HADOOP-2500
> URL: https://issues.apache.org/jira/browse/HADOOP-2500
> Project: Hadoop
> Issue Type: Bug
> Components: contrib/hbase
> Environment: CentOS 5
> Reporter: Chris Kline
> Priority: Critical
>
> Backgound: The name node (also a DataNode and RegionServer) in our cluster
> ran out of disk space. I created some space, restarted HDFS and fsck
> reported corruption with an HBase file. I cleared up that corruption and
> restarted HBase. I was still unable to read anything from HBase even though
> HSFS was now healthy.
> The following was gather from the log files. When HMaster starts up, it
> finds a region that is no good (Key: 17_125736271):
> 2007-12-24 09:07:14,342 DEBUG org.apache.hadoop.hbase.HMaster: Current
> assignment of spider_pages,17_125736271,1198286140018 is no good
> HMaster then assigns this region to RegionServer X.60:
> 2007-12-24 09:07:17,126 INFO org.apache.hadoop.hbase.HMaster: assigning
> region spider_pages,17_125736271,1198286140018 to server 10.100.11.60:60020
> 2007-12-24 09:07:20,152 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from
> 10.100.11.60:60020
> The RegionServer has trouble reading that region (from the RegionServer log
> on X.60); Note that the worker thread exits
> 2007-12-24 09:07:22,611 DEBUG org.apache.hadoop.hbase.HStore: starting
> spider_pages,17_125736271,1198286140018/meta (2062710340/meta with
> reconstruction log: (/data/hbase1/hregion_2062710340/oldlogfile.log
> 2007-12-24 09:07:22,620 DEBUG org.apache.hadoop.hbase.HStore: maximum
> sequence id for hstore spider_pages,17_125736271,1198286140018/meta
> (2062710340/meta) is 4549496
> 2007-12-24 09:07:22,622 ERROR org.apache.hadoop.hbase.HRegionServer: error
> opening region spider_pages,17_125736271,1198286140018
> java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:180)
> at java.io.DataInputStream.readFully(DataInputStream.java:152)
> at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1360)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
> at org.apache.hadoop.hbase.HStore.doReconstructionLog(HStore.java:697)
> at org.apache.hadoop.hbase.HStore.<init>(HStore.java:632)
> at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:288)
> at
> org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1211)
> at
> org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
> at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 FATAL org.apache.hadoop.hbase.HRegionServer:
> Unhandled exception
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.HRegionServer.reportClose(HRegionServer.java:1095)
> at
> org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer.java:1217)
> at
> org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:1162)
> at java.lang.Thread.run(Thread.java:619)
> 2007-12-24 09:07:22,623 INFO org.apache.hadoop.hbase.HRegionServer: worker
> thread exiting
> The HMaster then tries to assign the same region to X.60 again and fails.
> The HMaster tries to assign the region to X.31 with the same result (X.31
> worker thread exits).
> The file it is complaining about,
> /data/hbase1/hregion_2062710340/oldlogfile.log, is a zero-length file in
> HDFS. After deleting that file and restarting HBase, HBase appears to be
> back to normal.
> One thing I can't figure out is that the HMaster log show several entries
> after the worker thread on X.60 has exited suggesting that the RegionServer
> is talking with HMaster:
> 2007-12-24 09:08:23,349 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from
> 10.100.11.60:60020
> 2007-12-24 09:10:29,543 DEBUG org.apache.hadoop.hbase.HMaster: Received
> MSG_REPORT_PROCESS_OPEN : spider_pages,17_125736271,1198286140018 from
> 10.100.11.60:60020
> There is no corresponding entry in the RegionServer's log.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.