Thanks a lot guys, for such illustrative explanation. I will go through the links you send and will get back with any doubts I have.
Thanks, Praveenesh On Thu, Jan 19, 2012 at 2:17 PM, Sameer Farooqui <sam...@blueplastic.com>wrote: > Hey Praveenesh, > > Here's a good article on HDFS by some senior Yahoo!, Facebook, HortonWorks > and eBay engineers that you might find helpful: > http://www.aosabook.org/en/hdfs.html > > You may already know that "each block replica on a DataNode is represented > by two files in the DataNode's local, native filesystem (usually ext3). The > first file contains the data itself and the second file records the block's > metadata including checksums for the data and the generation stamp." > > In section 8.3.5., the article above describes a Block Scanner that runs on > each DataNode and "periodically scans its block replicas and verifies that > stored checksums match the block data." > > More copy+paste from the article: "Whenever a read client or a block > scanner detects a corrupt block, it notifies the NameNode. The NameNode > marks the replica as corrupt, but does not schedule deletion of the replica > immediately. Instead, it starts to replicate a good copy of the block. Only > when the good replica count reaches the replication factor of the block the > corrupt replica is scheduled to be removed. This policy aims to preserve > data as long as possible. So even if all replicas of a block are corrupt, > the policy allows the user to retrieve its data from the corrupt replicas." > > Like Harsh J was saying in an email before, this doesn't sound like > NameNode corruption yet. The article also describes how the periodic block > reports (aka metadata) from the DataNode are sent to the NameNode. "A block > report contains the block ID, the generation stamp and the length for each > block replica the server hosts." In the NameNode's RAM, "the inodes and the > list of blocks that define the metadata of the name system are called the * > image*. The persistent record of the image stored in the NameNode's local > native filesystem is called a checkpoint. The NameNode records changes to > HDFS in a write-ahead log called the journal in its local native > filesystem." > > You can check those NameNode checkpoint and journal logs for errors if you > suspect NameNode corruption. > > If you're wondering how often the Block Scanner runs and how long it takes > to scan over the entire dataset in HDFS: "In each scan period, the block > scanner adjusts the read bandwidth in order to complete the verification in > a configurable period. If a client reads a complete block and checksum > verification succeeds, it informs the DataNode. The DataNode treats it as a > verification of the replica." > > "The verification time of each block is stored in a human-readable log > file. At any time there are up to two files in the top-level DataNode > directory, the current and previous logs. New verification times are > appended to the current file. Correspondingly, each DataNode has an > in-memory scanning list ordered by the replica's verification time." > > Can you maybe check the verification time for the blocks that went corrupt > in the log file? If you're a human you should be able to read it. Try > checking both the current and previous logs. > > To dive deeper, here is a document by Tom White/Cloudera, but it's from > 2008, so a lot could be out-dated: > http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf > > One good bit of info from Tom's doc is that you can view the DataNode's > Block Scanner reports at: http://datanode:50075/ blockScannerReport > > And if you could post the filesystem check output logs (cmd:fsck), I'm sure > someone could help you further. It would be helpful to know which version > of Hadoop and HDFS you're running. > > Also, don't you think it's weird that all the missing blocks were that of > the outputs of your M/R jobs? The NameNode should have been distributing > them evenly across the hard drives of your cluster. If the output of the > jobs is set to replication factor = 2, then the output should have been > replicated over the network to at least one other DataNode. It should take > at least 2 hard drives to fail in the cluster for you to lose a replica > completely. HDFS should be very robust. With Yahoo's r=3, "for a large > cluster, the probability of losing a block during one year is less than > 0.005" > > - Sameer > > > On Wed, Jan 18, 2012 at 11:19 PM, praveenesh kumar <praveen...@gmail.com > >wrote: > > > Hi everyone, > > Any ideas on how to tackle this kind of situation. > > > > Thanks, > > Praveenesh > > > > On Tue, Jan 17, 2012 at 1:02 PM, praveenesh kumar <praveen...@gmail.com > > >wrote: > > > > > I have a replication factor of 2, because of the reason that I can not > > > afford 3 replicas on my cluster. > > > fsck output was saying block replicas missing for some files that was > > > making Namenode is corrupt > > > I don't have the output with me. but issue was block replicas were > > > missing. How can we tackle that ? > > > > > > Is their an internal mechanism of creating new blocks, if they were > found > > > missing / some kind of refresh command or something ? > > > > > > > > > Thanks, > > > Praveenesh > > > > > > On Tue, Jan 17, 2012 at 12:48 PM, Harsh J <ha...@cloudera.com> wrote: > > > > > >> You ran into a corrupt files issue, not a namenode corruption (which > > >> generally refers to the fsimage or edits getting corrupted). > > >> > > >> Did your files not have adequate replication that they could not > > >> withstand the loss of one DN's disk? What exactly did fsck output? Did > > all > > >> block replicas go missing for your files? > > >> > > >> On 17-Jan-2012, at 12:08 PM, praveenesh kumar wrote: > > >> > > >> > Hi guys, > > >> > > > >> > I just faced a weird situation, in which one of my hard disks on DN > > went > > >> > down. > > >> > Due to which when I restarted namenode, some of the blocks went > > missing > > >> and > > >> > it was saying my namenode is CORRUPT and in safe mode, which doesn't > > >> allow > > >> > you to add or delete any files on HDFS. > > >> > > > >> > I know , we can close the safe mode part. > > >> > Problem is how to deal with Corrupt Namenode problem in this case -- > > >> Best > > >> > practices. > > >> > > > >> > In my case, I was lucky that all missing blocks were that of the > > >> Outputs of > > >> > my M/R codes I ran previously. > > >> > So I just deleted all those files with the missing blocks from HDFS > to > > >> come > > >> > from CORRUPT --> HEALTHY state. > > >> > > > >> > But had it be for the large input data files , it won't be a good > > >> solution > > >> > in that case to delete those files. > > >> > > > >> > So I wanted to know what should be the best practices to deal with > > above > > >> > kind of problems to go from CORRUPT NAMENODE --> HEALTHY NAMENODE? > > >> > > > >> > Thanks, > > >> > Praveenesh > > >> > > >> -- > > >> Harsh J > > >> Customer Ops. Engineer, Cloudera > > >> > > >> > > > > > >