You should take a look at that server's region server process to see its health, it should recover (use jps to find if the process is still running, maybe tail the log to see what's going on, worst case you can kill -9). For how long was the master stuck? I remember there was an issue with for some time with 0.89, HBASE-2975, can you verify that the one you're running has it? Check the CHANGES file. The version we have currently have on github has it.
I agree the master should be able to ride of that, but the issue is at the HDFS level. If I remember correctly, the append implementation in hadoop 0.21 doesn't have that problem but HBase doesn't support at the mo. Also 0.21 is unstable (it didn't go through Y!'s QA as much as the other releases do). The other option HBase has is what you did, either removing or just ignoring the file. In both cases, you do lose data. J-D On Thu, Oct 21, 2010 at 10:48 AM, Jack Levin <magn...@gmail.com> wrote: > How do we best resolve something like that, I just deleted that > file... does it mean I might have lost inserts? > > -Jack > > On Thu, Oct 21, 2010 at 10:32 AM, Jean-Daniel Cryans > <jdcry...@apache.org> wrote: >> This can happen when the original owner of the file is still alive, in >> your case is the region server it's recovering (10.103.5.6) is still >> running? If it GCed hard, then it probably stayed "alive" for a while >> but it should shut down when it wakes up. >> >> J-D >> >> On Thu, Oct 21, 2010 at 10:10 AM, Jack Levin <magn...@gmail.com> wrote: >>> 2010-10-21 10:08:14,268 WARN org.apache.hadoop.hbase.util.FSUtils: >>> Waited 2014334ms for lease recovery on >>> hdfs://namenode-rd.imageshack.us:9000/hbase/.logs/mtae6.prod.imageshack.com,60020,1287624295377/10.103.5.6%3A60020.1287672366636:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: >>> failed to create file >>> /hbase/.logs/mtae6.prod.imageshack.com,60020,1287624295377/10.103.5.6%3A60020.1287672366636 >>> for DFSClient_hb_m_10.101.7.1:60000_1287616820725 on client >>> 10.101.7.1, because this file is already being created by NN_Recovery >>> on 10.103.5.6 >>> >>> >>> Seems to be like a new problem we discovered - any ideas what this means? >>> >>> -Jack >>> >> >