Just want to follow up to first thank QwertyM aka Harsh Chouraria for helping 
me out on the IRC channel. Well beyond the call of duty! Its people like Harsh 
that make the HBase/Hadoop community what it is and one of the joys of working  
with this technology. And then one follow on question on how to recover from 
CORRUPT blocks.

The main thing I learnt other than being careful not to install packages on all 
the regionservers/slaves at one time that may cause Out of Memory Errors and 
crash all your java processes.. is that if:

Your namenode is stuck in safe mode, and even though the namenode log says that 
"Safe mode will be turned off automatically."
If there is enough wrong with your HDFS system like too many under-replicated 
blocks.
It seems that it has to be out of safe mode to correct the problem... 

I hallucinated that the datanodes by doing verifications were doing the work to 
get the namenode out of safe mode. And probably would have waited another few 
hours if Harsh hadn't helped me out and told me what probably everyone but me 
knew:

hadoop dfsadmin -safemode leave


CURRENT QUESTION ON CORRUPT BLOCKS:
------------------------------------------------------------------

After that the namenode did get all the under-replicated blocks replicated, but 
I ended up with about 200 blocks that fsck considered CORRUPT and/or MISSING. 
It looked like tables were being compacted when the outage occurred. Otherwise 
I don't know why a lot of the bad blocks are in old tables, not data being 
written at the time of the crash. The hdfs filesystem dates also showed them as 
being old.

I am not sure what is the best thing to do now to be able to recover the 
CORRUPT/MISSING blocks and to get fsck to say all is healthy. 

Is the best thing to just do:

hadoop fsck -move

which will move what is left of the corrupt blocks into hdfs /lost+found?

Is there any way to recover those blocks? 

I may be able to get them from the backup/export of all our tables we did 
recently and I believe I can regenerate the rest. But it would be nice to know 
if there is a way to recover them if there was no other way.

Thanks in advance.
Rob
 
On Sep 16, 2011, at 12:50 AM, Robert J Berger wrote:

> Just had an HDFS/HBase instance where all the slave/regionservers processes 
> crashed, but the namenode stayed up. I did proper shutdown of the namenode
> 
> After bringing Hadoop back up the namenode is stuck in safe mode. Fsck shows 
> 235 corrupt/missing blocks out of 117280 Blocks. All the slaves are doing 
> DataBlockScanner: Verification succeeded. As far as I can tell there are no 
> errors in the datanodes.
> 
> Can I expect it to self-heal? Or do I need to do something to help it along? 
> Anyway to tell how long it will take to recover if I do have to just wait?
> 
> Other than the verification messages on the datanodes, the namenode fsck 
> numbers are not changing and the namenode log continues to say:
> 
> The ratio of reported blocks 0.9980 has not reached the threshold 0.9990. 
> Safe mode will be turned off automatically.
> 
> The ratio has not changed for over an hour now.
> 
> If you happen to know the answer, please get back to me right away by email 
> or on #hadoop IRC as I'm trying to figure it out now...
> 
> Thanks!
> __________________
> Robert J Berger - CTO
> Runa Inc.
> +1 408-838-8896
> http://blog.ibd.com
> 
> 
> 

__________________
Robert J Berger - CTO
Runa Inc.
+1 408-838-8896
http://blog.ibd.com



Reply via email to