First, see the Jon Gray response. His postulate that the root of your issues are machines swapping seems likely to me.

See below for some particular answers to your queries (thanks for the detail).

Jean-Adrien wrote:
The attempts of above can be:
1.
java.io.IOException: java.io.IOException: Premeture EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)

Did you say your disks had filled? If so, this is likely cause of above (but on our cluster here, we've also been seeing the above and are looking at HADOOP-3831)

2-10
java.io.IOException: java.io.IOException: java.lang.NullPointerException
        at org.apache.hadoop.hbase.HStoreKey.compareTo(HStoreKey.java:354)

Was there more stacktrace on this error? May I see it? Above should never happen (smile).

...

Another 10 attempts scenario I have seen:
1-10:
IPC Server handler 3 on 60020, call getRow([EMAIL PROTECTED], [EMAIL 
PROTECTED], null,
1224105427910, -1) from 192.168.1.11:55371: error: java.io.IOException:
Cannot open filename
/hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data
java.io.IOException: Cannot open filename
/hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data
        at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1171)

Preceded, in concerned regionsserver log, by the line:

2008-10-15 23:19:30,461 INFO org.apache.hadoop.dfs.DFSClient: Could not
obtain block blk_-3759213227484579481_226277 from any node: java.io.IOException: No live nodes contain current block

hdfs is hosed; it lost a block from the named file. If hdfs is hosed, so is hbase.


If I look for this block in the hadoop master log I can find

2008-10-15 23:03:45,276 INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask
192.168.1.13:50010 to delete  [...] blk_-3759213227484579481_226277 [...]
(many more blocks)

This is interesting. I wonder why hdfs is deleting a block that subsequently a regionserver is trying to use? Can you correlate the blocks' story with hbase actions? (Thats probably an unfair question to ask since it would require deep detective work on hbase logs trying to trace the file whose block is missing and its hosting region as it moved around the cluster).
about 16 min before.
In both cases the regionserver fails to serve the concerned region until I
restart hbase (not hadoop).

Not hadoop?  And if you ran an fsck on the filesystem, its healthy?

One last question by the way:
Why the replication factor of my hbase files in dfs is 3, when my hadoop
cluster is configured to keep only 2 copies ?
See http://wiki.apache.org/hadoop/Hbase/FAQ#12.

Is it because the default (hadoop-default.xml) config file of the hadoop
client, which is embedded in hbase distrib overrides the cluster
configuration for the mapfiles created ?
Yes.

Thanks for the questions J-A.
St.Ack

Reply via email to