Re: Best practices to recover from Corrupt Namenode

2012-01-20 Thread praveenesh kumar
Thanks a lot guys, for such illustrative explanation. I will go through the
links you send and will get back with any doubts I have.

Thanks,
Praveenesh

On Thu, Jan 19, 2012 at 2:17 PM, Sameer Farooqui sam...@blueplastic.comwrote:

 Hey Praveenesh,

 Here's a good article on HDFS by some senior Yahoo!, Facebook, HortonWorks
 and eBay engineers that you might find helpful:
 http://www.aosabook.org/en/hdfs.html

 You may already know that each block replica on a DataNode is represented
 by two files in the DataNode's local, native filesystem (usually ext3). The
 first file contains the data itself and the second file records the block's
 metadata including checksums for the data and the generation stamp.

 In section 8.3.5., the article above describes a Block Scanner that runs on
 each DataNode and periodically scans its block replicas and verifies that
 stored checksums match the block data.

 More copy+paste from the article: Whenever a read client or a block
 scanner detects a corrupt block, it notifies the NameNode. The NameNode
 marks the replica as corrupt, but does not schedule deletion of the replica
 immediately. Instead, it starts to replicate a good copy of the block. Only
 when the good replica count reaches the replication factor of the block the
 corrupt replica is scheduled to be removed. This policy aims to preserve
 data as long as possible. So even if all replicas of a block are corrupt,
 the policy allows the user to retrieve its data from the corrupt replicas.

 Like Harsh J was saying in an email before, this doesn't sound like
 NameNode corruption yet. The article also describes how the periodic block
 reports (aka metadata) from the DataNode are sent to the NameNode. A block
 report contains the block ID, the generation stamp and the length for each
 block replica the server hosts. In the NameNode's RAM, the inodes and the
 list of blocks that define the metadata of the name system are called the *
 image*. The persistent record of the image stored in the NameNode's local
 native filesystem is called a checkpoint. The NameNode records changes to
 HDFS in a write-ahead log called the journal in its local native
 filesystem.

 You can check those NameNode checkpoint and journal logs for errors if you
 suspect NameNode corruption.

 If you're wondering how often the Block Scanner runs and how long it takes
 to scan over the entire dataset in HDFS: In each scan period, the block
 scanner adjusts the read bandwidth in order to complete the verification in
 a configurable period. If a client reads a complete block and checksum
 verification succeeds, it informs the DataNode. The DataNode treats it as a
 verification of the replica.

 The verification time of each block is stored in a human-readable log
 file. At any time there are up to two files in the top-level DataNode
 directory, the current and previous logs. New verification times are
 appended to the current file. Correspondingly, each DataNode has an
 in-memory scanning list ordered by the replica's verification time.

 Can you maybe check the verification time for the blocks that went corrupt
 in the log file? If you're a human you should be able to read it. Try
 checking both the current and previous logs.

 To dive deeper, here is a document by Tom White/Cloudera, but it's from
 2008, so a lot could be out-dated:
 http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf

 One good bit of info from Tom's doc is that you can view the DataNode's
 Block Scanner reports at: http://datanode:50075/ blockScannerReport

 And if you could post the filesystem check output logs (cmd:fsck), I'm sure
 someone could help you further. It would be helpful to know which version
 of Hadoop and HDFS you're running.

 Also, don't you think it's weird that all the missing blocks were that of
 the outputs of your M/R jobs? The NameNode should have been distributing
 them evenly across the hard drives of your cluster. If the output of the
 jobs is set to replication factor = 2, then the output should have been
 replicated over the network to at least one other DataNode. It should take
 at least 2 hard drives to fail in the cluster for you to lose a replica
 completely. HDFS should be very robust. With Yahoo's r=3, for a large
 cluster, the probability of losing a block during one year is less than
 0.005

 - Sameer


 On Wed, Jan 18, 2012 at 11:19 PM, praveenesh kumar praveen...@gmail.com
 wrote:

  Hi everyone,
  Any ideas on how to tackle this kind of situation.
 
  Thanks,
  Praveenesh
 
  On Tue, Jan 17, 2012 at 1:02 PM, praveenesh kumar praveen...@gmail.com
  wrote:
 
   I have a replication factor of 2, because of the reason that I can not
   afford 3 replicas on my cluster.
   fsck output was saying block replicas missing for some files that was
   making Namenode is corrupt
   I don't have the output with me. but issue was block replicas were
   missing. How can we tackle that ?
  
   Is their an internal mechanism of creating 

Re: Best practices to recover from Corrupt Namenode

2012-01-19 Thread Sameer Farooqui
Hey Praveenesh,

Here's a good article on HDFS by some senior Yahoo!, Facebook, HortonWorks
and eBay engineers that you might find helpful:
http://www.aosabook.org/en/hdfs.html

You may already know that each block replica on a DataNode is represented
by two files in the DataNode's local, native filesystem (usually ext3). The
first file contains the data itself and the second file records the block's
metadata including checksums for the data and the generation stamp.

In section 8.3.5., the article above describes a Block Scanner that runs on
each DataNode and periodically scans its block replicas and verifies that
stored checksums match the block data.

More copy+paste from the article: Whenever a read client or a block
scanner detects a corrupt block, it notifies the NameNode. The NameNode
marks the replica as corrupt, but does not schedule deletion of the replica
immediately. Instead, it starts to replicate a good copy of the block. Only
when the good replica count reaches the replication factor of the block the
corrupt replica is scheduled to be removed. This policy aims to preserve
data as long as possible. So even if all replicas of a block are corrupt,
the policy allows the user to retrieve its data from the corrupt replicas.

Like Harsh J was saying in an email before, this doesn't sound like
NameNode corruption yet. The article also describes how the periodic block
reports (aka metadata) from the DataNode are sent to the NameNode. A block
report contains the block ID, the generation stamp and the length for each
block replica the server hosts. In the NameNode's RAM, the inodes and the
list of blocks that define the metadata of the name system are called the *
image*. The persistent record of the image stored in the NameNode's local
native filesystem is called a checkpoint. The NameNode records changes to
HDFS in a write-ahead log called the journal in its local native
filesystem.

You can check those NameNode checkpoint and journal logs for errors if you
suspect NameNode corruption.

If you're wondering how often the Block Scanner runs and how long it takes
to scan over the entire dataset in HDFS: In each scan period, the block
scanner adjusts the read bandwidth in order to complete the verification in
a configurable period. If a client reads a complete block and checksum
verification succeeds, it informs the DataNode. The DataNode treats it as a
verification of the replica.

The verification time of each block is stored in a human-readable log
file. At any time there are up to two files in the top-level DataNode
directory, the current and previous logs. New verification times are
appended to the current file. Correspondingly, each DataNode has an
in-memory scanning list ordered by the replica's verification time.

Can you maybe check the verification time for the blocks that went corrupt
in the log file? If you're a human you should be able to read it. Try
checking both the current and previous logs.

To dive deeper, here is a document by Tom White/Cloudera, but it's from
2008, so a lot could be out-dated:
http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf

One good bit of info from Tom's doc is that you can view the DataNode's
Block Scanner reports at: http://datanode:50075/ blockScannerReport

And if you could post the filesystem check output logs (cmd:fsck), I'm sure
someone could help you further. It would be helpful to know which version
of Hadoop and HDFS you're running.

Also, don't you think it's weird that all the missing blocks were that of
the outputs of your M/R jobs? The NameNode should have been distributing
them evenly across the hard drives of your cluster. If the output of the
jobs is set to replication factor = 2, then the output should have been
replicated over the network to at least one other DataNode. It should take
at least 2 hard drives to fail in the cluster for you to lose a replica
completely. HDFS should be very robust. With Yahoo's r=3, for a large
cluster, the probability of losing a block during one year is less than
0.005

- Sameer


On Wed, Jan 18, 2012 at 11:19 PM, praveenesh kumar praveen...@gmail.comwrote:

 Hi everyone,
 Any ideas on how to tackle this kind of situation.

 Thanks,
 Praveenesh

 On Tue, Jan 17, 2012 at 1:02 PM, praveenesh kumar praveen...@gmail.com
 wrote:

  I have a replication factor of 2, because of the reason that I can not
  afford 3 replicas on my cluster.
  fsck output was saying block replicas missing for some files that was
  making Namenode is corrupt
  I don't have the output with me. but issue was block replicas were
  missing. How can we tackle that ?
 
  Is their an internal mechanism of creating new blocks, if they were found
  missing / some kind of refresh command  or something ?
 
 
  Thanks,
  Praveenesh
 
  On Tue, Jan 17, 2012 at 12:48 PM, Harsh J ha...@cloudera.com wrote:
 
  You ran into a corrupt files issue, not a namenode corruption (which
  generally refers to the fsimage or edits getting 

Re: Best practices to recover from Corrupt Namenode

2012-01-18 Thread praveenesh kumar
Hi everyone,
Any ideas on how to tackle this kind of situation.

Thanks,
Praveenesh

On Tue, Jan 17, 2012 at 1:02 PM, praveenesh kumar praveen...@gmail.comwrote:

 I have a replication factor of 2, because of the reason that I can not
 afford 3 replicas on my cluster.
 fsck output was saying block replicas missing for some files that was
 making Namenode is corrupt
 I don't have the output with me. but issue was block replicas were
 missing. How can we tackle that ?

 Is their an internal mechanism of creating new blocks, if they were found
 missing / some kind of refresh command  or something ?


 Thanks,
 Praveenesh

 On Tue, Jan 17, 2012 at 12:48 PM, Harsh J ha...@cloudera.com wrote:

 You ran into a corrupt files issue, not a namenode corruption (which
 generally refers to the fsimage or edits getting corrupted).

 Did your files not have adequate replication that they could not
 withstand the loss of one DN's disk? What exactly did fsck output? Did all
 block replicas go missing for your files?

 On 17-Jan-2012, at 12:08 PM, praveenesh kumar wrote:

  Hi guys,
 
  I just faced a weird situation, in which one of my hard disks on DN went
  down.
  Due to which when I restarted namenode, some of the blocks went missing
 and
  it was saying my namenode is CORRUPT and in safe mode, which doesn't
 allow
  you to add or delete any files on HDFS.
 
  I know , we can close the safe mode part.
  Problem is how to deal with Corrupt Namenode problem in this case --
 Best
  practices.
 
  In my case, I was lucky that all missing blocks were that of the
 Outputs of
  my M/R codes I ran previously.
  So I just deleted all those files with the missing blocks from HDFS to
 come
  from CORRUPT -- HEALTHY state.
 
  But had it be for the large input data files , it won't be a good
 solution
  in that case to delete those files.
 
  So I wanted to know what should be the best practices to deal with above
  kind of problems to go from CORRUPT NAMENODE -- HEALTHY NAMENODE?
 
  Thanks,
  Praveenesh

 --
 Harsh J
 Customer Ops. Engineer, Cloudera





Best practices to recover from Corrupt Namenode

2012-01-16 Thread praveenesh kumar
Hi guys,

I just faced a weird situation, in which one of my hard disks on DN went
down.
Due to which when I restarted namenode, some of the blocks went missing and
it was saying my namenode is CORRUPT and in safe mode, which doesn't allow
you to add or delete any files on HDFS.

I know , we can close the safe mode part.
Problem is how to deal with Corrupt Namenode problem in this case -- Best
practices.

In my case, I was lucky that all missing blocks were that of the Outputs of
my M/R codes I ran previously.
So I just deleted all those files with the missing blocks from HDFS to come
from CORRUPT -- HEALTHY state.

But had it be for the large input data files , it won't be a good solution
in that case to delete those files.

So I wanted to know what should be the best practices to deal with above
kind of problems to go from CORRUPT NAMENODE -- HEALTHY NAMENODE?

Thanks,
Praveenesh


Re: Best practices to recover from Corrupt Namenode

2012-01-16 Thread Harsh J
You ran into a corrupt files issue, not a namenode corruption (which generally 
refers to the fsimage or edits getting corrupted).

Did your files not have adequate replication that they could not withstand the 
loss of one DN's disk? What exactly did fsck output? Did all block replicas go 
missing for your files?

On 17-Jan-2012, at 12:08 PM, praveenesh kumar wrote:

 Hi guys,
 
 I just faced a weird situation, in which one of my hard disks on DN went
 down.
 Due to which when I restarted namenode, some of the blocks went missing and
 it was saying my namenode is CORRUPT and in safe mode, which doesn't allow
 you to add or delete any files on HDFS.
 
 I know , we can close the safe mode part.
 Problem is how to deal with Corrupt Namenode problem in this case -- Best
 practices.
 
 In my case, I was lucky that all missing blocks were that of the Outputs of
 my M/R codes I ran previously.
 So I just deleted all those files with the missing blocks from HDFS to come
 from CORRUPT -- HEALTHY state.
 
 But had it be for the large input data files , it won't be a good solution
 in that case to delete those files.
 
 So I wanted to know what should be the best practices to deal with above
 kind of problems to go from CORRUPT NAMENODE -- HEALTHY NAMENODE?
 
 Thanks,
 Praveenesh

--
Harsh J
Customer Ops. Engineer, Cloudera



Re: Best practices to recover from Corrupt Namenode

2012-01-16 Thread praveenesh kumar
I have a replication factor of 2, because of the reason that I can not
afford 3 replicas on my cluster.
fsck output was saying block replicas missing for some files that was
making Namenode is corrupt
I don't have the output with me. but issue was block replicas were missing.
How can we tackle that ?

Is their an internal mechanism of creating new blocks, if they were found
missing / some kind of refresh command  or something ?


Thanks,
Praveenesh

On Tue, Jan 17, 2012 at 12:48 PM, Harsh J ha...@cloudera.com wrote:

 You ran into a corrupt files issue, not a namenode corruption (which
 generally refers to the fsimage or edits getting corrupted).

 Did your files not have adequate replication that they could not withstand
 the loss of one DN's disk? What exactly did fsck output? Did all block
 replicas go missing for your files?

 On 17-Jan-2012, at 12:08 PM, praveenesh kumar wrote:

  Hi guys,
 
  I just faced a weird situation, in which one of my hard disks on DN went
  down.
  Due to which when I restarted namenode, some of the blocks went missing
 and
  it was saying my namenode is CORRUPT and in safe mode, which doesn't
 allow
  you to add or delete any files on HDFS.
 
  I know , we can close the safe mode part.
  Problem is how to deal with Corrupt Namenode problem in this case -- Best
  practices.
 
  In my case, I was lucky that all missing blocks were that of the Outputs
 of
  my M/R codes I ran previously.
  So I just deleted all those files with the missing blocks from HDFS to
 come
  from CORRUPT -- HEALTHY state.
 
  But had it be for the large input data files , it won't be a good
 solution
  in that case to delete those files.
 
  So I wanted to know what should be the best practices to deal with above
  kind of problems to go from CORRUPT NAMENODE -- HEALTHY NAMENODE?
 
  Thanks,
  Praveenesh

 --
 Harsh J
 Customer Ops. Engineer, Cloudera