Hi,
we had an interesting failure yesterday on the old 0.20.4 version of hbase.
I realize that this is a very old version but am wondering whether this is
an issue that is still present and should be fixed.
We added a new node to a 44 node cluster starting the datanode and
regionserver processes on it. The Unix filesystem was configured
incorrectly, i.e. /tmp was not writable to hadoop process. Both datanode and
regionserver processes had issues with the permissions.

The datanode process stopped with an error message:

2011-08-06 23:37:20,469 WARN org.mortbay.log: tmpdir
java.io.IOException: Permission denied
        at java.io.UnixFileSystem.createFileExclusively(Native Method)
        at java.io.File.checkAndCreate(File.java:1704)
        at java.io.File.createTempFile(File.java:1792)
        at java.io.File.createTempFile(File.java:1828)
        at
org.mortbay.jetty.webapp.WebAppContext.getTempDirectory(WebAppContext.java:745)
        ....
2011-08-06 23:37:20,471 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at hdpxxx
************************************************************/

The regionserver did not stop even though the error was logged:

2011-08-07 00:07:39.742::WARN:  tmpdir
java.io.IOException: Permission denied
        at java.io.UnixFileSystem.createFileExclusively(Native Method)
        at java.io.File.checkAndCreate(File.java:1704)
        at java.io.File.createTempFile(File.java:1792)
        at java.io.File.createTempFile(File.java:1828)
        at
org.mortbay.jetty.webapp.WebAppContext.getTempDirectory(WebAppContext.java:745)
       .......
        at org.apache.hadoop.http.HttpServer.start(HttpServer.java:461)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.startServiceThreads(HRegionServer.java:1168)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:792)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:430)

In fact to the master process the regionserver looked fine, so it was trying
to send regions its way. Regionserver rejected them. So the master/balancer
was going into a assign/reassign cycle destabilizing the cluster. Many puts
and gets simply failed with NotServingRegionExceptions and took a long time
to complete.

Please advise whether this may be a problem in 0.90x code.

Cheers Matthias

Reply via email to