Re: Best practices to recover from Corrupt Namenode

praveenesh kumar Fri, 20 Jan 2012 01:03:48 -0800

Thanks a lot guys, for such illustrative explanation. I will go through the
links you send and will get back with any doubts I have.


Thanks,
Praveenesh

On Thu, Jan 19, 2012 at 2:17 PM, Sameer Farooqui <sam...@blueplastic.com>wrote:

> Hey Praveenesh,
>
> Here's a good article on HDFS by some senior Yahoo!, Facebook, HortonWorks
> and eBay engineers that you might find helpful:
> http://www.aosabook.org/en/hdfs.html
>
> You may already know that "each block replica on a DataNode is represented
> by two files in the DataNode's local, native filesystem (usually ext3). The
> first file contains the data itself and the second file records the block's
> metadata including checksums for the data and the generation stamp."
>
> In section 8.3.5., the article above describes a Block Scanner that runs on
> each DataNode and "periodically scans its block replicas and verifies that
> stored checksums match the block data."
>
> More copy+paste from the article: "Whenever a read client or a block
> scanner detects a corrupt block, it notifies the NameNode. The NameNode
> marks the replica as corrupt, but does not schedule deletion of the replica
> immediately. Instead, it starts to replicate a good copy of the block. Only
> when the good replica count reaches the replication factor of the block the
> corrupt replica is scheduled to be removed. This policy aims to preserve
> data as long as possible. So even if all replicas of a block are corrupt,
> the policy allows the user to retrieve its data from the corrupt replicas."
>
> Like Harsh J was saying in an email before, this doesn't sound like
> NameNode corruption yet. The article also describes how the periodic block
> reports (aka metadata) from the DataNode are sent to the NameNode. "A block
> report contains the block ID, the generation stamp and the length for each
> block replica the server hosts." In the NameNode's RAM, "the inodes and the
> list of blocks that define the metadata of the name system are called the *
> image*. The persistent record of the image stored in the NameNode's local
> native filesystem is called a checkpoint. The NameNode records changes to
> HDFS in a write-ahead log called the journal in its local native
> filesystem."
>
> You can check those NameNode checkpoint and journal logs for errors if you
> suspect NameNode corruption.
>
> If you're wondering how often the Block Scanner runs and how long it takes
> to scan over the entire dataset in HDFS: "In each scan period, the block
> scanner adjusts the read bandwidth in order to complete the verification in
> a configurable period. If a client reads a complete block and checksum
> verification succeeds, it informs the DataNode. The DataNode treats it as a
> verification of the replica."
>
> "The verification time of each block is stored in a human-readable log
> file. At any time there are up to two files in the top-level DataNode
> directory, the current and previous logs. New verification times are
> appended to the current file. Correspondingly, each DataNode has an
> in-memory scanning list ordered by the replica's verification time."
>
> Can you maybe check the verification time for the blocks that went corrupt
> in the log file? If you're a human you should be able to read it. Try
> checking both the current and previous logs.
>
> To dive deeper, here is a document by Tom White/Cloudera, but it's from
> 2008, so a lot could be out-dated:
> http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
>
> One good bit of info from Tom's doc is that you can view the DataNode's
> Block Scanner reports at: http://datanode:50075/ blockScannerReport
>
> And if you could post the filesystem check output logs (cmd:fsck), I'm sure
> someone could help you further. It would be helpful to know which version
> of Hadoop and HDFS you're running.
>
> Also, don't you think it's weird that all the missing blocks were that of
> the outputs of your M/R jobs? The NameNode should have been distributing
> them evenly across the hard drives of your cluster. If the output of the
> jobs is set to replication factor = 2, then the output should have been
> replicated over the network to at least one other DataNode. It should take
> at least 2 hard drives to fail in the cluster for you to lose a replica
> completely. HDFS should be very robust. With Yahoo's r=3, "for a large
> cluster, the probability of losing a block during one year is less than
> 0.005"
>
> - Sameer
>
>
> On Wed, Jan 18, 2012 at 11:19 PM, praveenesh kumar <praveen...@gmail.com
> >wrote:
>
> > Hi everyone,
> > Any ideas on how to tackle this kind of situation.
> >
> > Thanks,
> > Praveenesh
> >
> > On Tue, Jan 17, 2012 at 1:02 PM, praveenesh kumar <praveen...@gmail.com
> > >wrote:
> >
> > > I have a replication factor of 2, because of the reason that I can not
> > > afford 3 replicas on my cluster.
> > > fsck output was saying block replicas missing for some files that was
> > > making Namenode is corrupt
> > > I don't have the output with me. but issue was block replicas were
> > > missing. How can we tackle that ?
> > >
> > > Is their an internal mechanism of creating new blocks, if they were
> found
> > > missing / some kind of refresh command  or something ?
> > >
> > >
> > > Thanks,
> > > Praveenesh
> > >
> > > On Tue, Jan 17, 2012 at 12:48 PM, Harsh J <ha...@cloudera.com> wrote:
> > >
> > >> You ran into a corrupt files issue, not a namenode corruption (which
> > >> generally refers to the fsimage or edits getting corrupted).
> > >>
> > >> Did your files not have adequate replication that they could not
> > >> withstand the loss of one DN's disk? What exactly did fsck output? Did
> > all
> > >> block replicas go missing for your files?
> > >>
> > >> On 17-Jan-2012, at 12:08 PM, praveenesh kumar wrote:
> > >>
> > >> > Hi guys,
> > >> >
> > >> > I just faced a weird situation, in which one of my hard disks on DN
> > went
> > >> > down.
> > >> > Due to which when I restarted namenode, some of the blocks went
> > missing
> > >> and
> > >> > it was saying my namenode is CORRUPT and in safe mode, which doesn't
> > >> allow
> > >> > you to add or delete any files on HDFS.
> > >> >
> > >> > I know , we can close the safe mode part.
> > >> > Problem is how to deal with Corrupt Namenode problem in this case --
> > >> Best
> > >> > practices.
> > >> >
> > >> > In my case, I was lucky that all missing blocks were that of the
> > >> Outputs of
> > >> > my M/R codes I ran previously.
> > >> > So I just deleted all those files with the missing blocks from HDFS
> to
> > >> come
> > >> > from CORRUPT --> HEALTHY state.
> > >> >
> > >> > But had it be for the large input data files , it won't be a good
> > >> solution
> > >> > in that case to delete those files.
> > >> >
> > >> > So I wanted to know what should be the best practices to deal with
> > above
> > >> > kind of problems to go from CORRUPT NAMENODE --> HEALTHY NAMENODE?
> > >> >
> > >> > Thanks,
> > >> > Praveenesh
> > >>
> > >> --
> > >> Harsh J
> > >> Customer Ops. Engineer, Cloudera
> > >>
> > >>
> > >
> >
>

Re: Best practices to recover from Corrupt Namenode

Reply via email to