We are running hbase 0.94.2 on hadoop 0.20 append version in production (yes we have plans to upgrade hadoop). Its a 5 node cluster and a 6th node running just the name node and hmaster. I am seeing frequent RS YouAreDeadExceptions. Logs here http://pastebin.com/44aFyYZV The RS log shows a DFSOutputStream ResponseProcessor exception for block blk_-6695300470410774365_837638 java.io.EOFException at 13:41:00 followed by YouAreDeadException at the same time. I grep'ed this block in the Datanode (see log here http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception in receiveBlock for block blk_-6695300470410774365_837638 java.nio.channels.ClosedByInterruptException. I have also attached the namenode logs around the block here http://pastebin.com/9NE9J8s1
Across several RS failure instances I see the following pattern - the region server YouAreDeadException is always preceeded by the EOFException and datanode ClosedByInterruptException Is the error in the movement of the block causing the region server to report a YouAreDeadException? And of course, how do I solve this? - R