Jean-Adrien, Did you see my reply to your previous email?
I think your machines are underpowered for your current setup and it's creating all kinds of problems. If you have swapping going on in a regionserver/datanode, that must be addressed because it usually leads to odd behavior in hdfs, timeouts, starvation, etc... Decrease your allotted heap sizes to fit within available memory, or add more memory. JG -----Original Message----- From: Jean-Adrien [mailto:[EMAIL PROTECTED] Sent: Friday, October 17, 2008 1:02 AM To: [email protected] Subject: Regionserver fails to serve region Hello again. This is my last message for today I have often an exception in my HBase client. A regionserver fails to serve a region when the client get a row on the HBase cluster. org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact region server 192.168.1.15:60020 for region table-0.3,:testrow79063200,1223872616091, row ':testrow22102600', but failed after 10 attempts. The attempts of above can be: 1. java.io.IOException: java.io.IOException: Premeture EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102) 2-10 java.io.IOException: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hbase.HStoreKey.compareTo(HStoreKey.java:354) After what. Every time the client try to reach the same region the attemps 1-10 are java.io.IOException: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hbase.HStoreKey.compareTo(HStoreKey.java:354) In this case, if the client try to reach the same region again, all next 10 attemps are the NPE. Another 10 attempts scenario I have seen: 1-10: IPC Server handler 3 on 60020, call getRow([EMAIL PROTECTED], [EMAIL PROTECTED], null, 1224105427910, -1) from 192.168.1.11:55371: error: java.io.IOException: Cannot open filename /hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data java.io.IOException: Cannot open filename /hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1171) Preceded, in concerned regionsserver log, by the line: 2008-10-15 23:19:30,461 INFO org.apache.hadoop.dfs.DFSClient: Could not obtain block blk_-3759213227484579481_226277 from any node: java.io.IOException: No live nodes contain current block If I look for this block in the hadoop master log I can find 2008-10-15 23:03:45,276 INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask 192.168.1.13:50010 to delete [...] blk_-3759213227484579481_226277 [...] (many more blocks) about 16 min before. In both cases the regionserver fails to serve the concerned region until I restart hbase (not hadoop). I have no clue to know if such a failure is temporary (how long) or I really need to restart. But I noticed that the failure doesn't recover in the next 3-4 hours. One last question by the way: Why the replication factor of my hbase files in dfs is 3, when my hadoop cluster is configured to keep only 2 copies ? Is it because the default (hadoop-default.xml) config file of the hadoop client, which is embedded in hbase distrib overrides the cluster configuration for the mapfiles created ? Is that a good configuration scheme, or is it preferable to allow the hbase hadoop client to load the hadoop-site.xml file I have set for the running instance of hadoop server, adding the hadoop conf directory in the hbase classpath, and therefore having the same configuration in client than in server ? Have a nice day. Thank you for your advises. -- Jean-Adrien Cluster setup: 4 regionsservers / datanodes 1 is master / namenode as well. java-6-sun Total size of hdfs: 81.98 GB (replication factor 3) fsck -> healthy hadoop: 0.18.1 hbase: 0.18.0 (jar of hadoop replaced with 0.18.1) 1Gb ram per node -- View this message in context: http://www.nabble.com/Regionserver-fails-to-serve-region-tp20028553p20028553 .html Sent from the HBase User mailing list archive at Nabble.com.
