Hi,
we had an interesting failure yesterday on the old 0.20.4 version of hbase.
I realize that this is a very old version but am wondering whether this is
an issue that is still present and should be fixed.
We added a new node to a 44 node cluster starting the datanode and
regionserver processes on it. The Unix filesystem was configured
incorrectly, i.e. /tmp was not writable to hadoop process. Both datanode and
regionserver processes had issues with the permissions.
The datanode process stopped with an error message:
2011-08-06 23:37:20,469 WARN org.mortbay.log: tmpdir
java.io.IOException: Permission denied
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.checkAndCreate(File.java:1704)
at java.io.File.createTempFile(File.java:1792)
at java.io.File.createTempFile(File.java:1828)
at
org.mortbay.jetty.webapp.WebAppContext.getTempDirectory(WebAppContext.java:745)
....
2011-08-06 23:37:20,471 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at hdpxxx
************************************************************/
The regionserver did not stop even though the error was logged:
2011-08-07 00:07:39.742::WARN: tmpdir
java.io.IOException: Permission denied
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.checkAndCreate(File.java:1704)
at java.io.File.createTempFile(File.java:1792)
at java.io.File.createTempFile(File.java:1828)
at
org.mortbay.jetty.webapp.WebAppContext.getTempDirectory(WebAppContext.java:745)
.......
at org.apache.hadoop.http.HttpServer.start(HttpServer.java:461)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.startServiceThreads(HRegionServer.java:1168)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:792)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:430)
In fact to the master process the regionserver looked fine, so it was trying
to send regions its way. Regionserver rejected them. So the master/balancer
was going into a assign/reassign cycle destabilizing the cluster. Many puts
and gets simply failed with NotServingRegionExceptions and took a long time
to complete.
Please advise whether this may be a problem in 0.90x code.
Cheers Matthias