Lohit,

I run fsck after I replaced 1 DN (with data on it) with 1 blank DN and started 
all daemons.
I see the fsck report does include this:
    Missing replicas:              17025 (29.727087 %)

According to your explanation, this means that after I removed 1 DN I started 
missing about 30% of the blocks, right?
Wouldn't that mean that 30% of all blocks were *only* on the 1 DN that I 
removed?  But how could that be when I have replication factor of 3?

If I run bin/hadoop balancer with my old DN back in the cluster (and new DN 
removed), I do get the happy "The cluster is balanced" response.  So wouldn't 
that mean that everything is peachy and that if my replication factor is 3 then 
when I remove 1 DN, I should have only some portion of blocks under-replicated, 
but not *completely* missing from HDFS?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: lohit <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Friday, May 9, 2008 1:33:56 AM
> Subject: Re: Corrupt HDFS and salvaging data
> 
> Hi Otis,
> 
> Namenode has location information about all replicas of a block. When you run 
> fsck, namenode checks for those replicas. If all replicas are missing, then 
> fsck 
> reports the block as missing. Otherwise they are added to under replicated 
> blocks. If you specify -move or -delete option along with fsck, files with 
> such 
> missing blocks are moved to /lost+found or deleted depending on the option. 
> At what point did you run the fsck command, was it after the datanodes were 
> stopped? When you run namenode -format it would delete directories specified 
> in 
> dfs.name.dir. If directory exists it would ask for confirmation. 
> 
> Thanks,
> Lohit
> 
> ----- Original Message ----
> From: Otis Gospodnetic 
> To: core-user@hadoop.apache.org
> Sent: Thursday, May 8, 2008 9:00:34 PM
> Subject: Re: Corrupt HDFS and salvaging data
> 
> Hi,
> 
> Update:
> It seems fsck reports HDFS is corrupt when a significant-enough number of 
> block 
> replicas is missing (or something like that).
> fsck reported corrupt HDFS after I replaced 1 old DN with 1 new DN.  After I 
> restarted Hadoop with the old set of DNs, fsck stopped reporting corrupt HDFS 
> and started reporting *healthy* HDFS.
> 
> 
> I'll follow-up with re-balancing question in a separate email.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
> > From: Otis Gospodnetic 
> > To: core-user@hadoop.apache.org
> > Sent: Thursday, May 8, 2008 11:35:01 PM
> > Subject: Corrupt HDFS and salvaging data
> > 
> > Hi,
> > 
> > I have a case of a corrupt HDFS (according to bin/hadoop fsck) and I'm 
> > trying 
> > not to lose the precious data in it.  I accidentally run bin/hadoop 
> > namenode 
> > -format on a *new DN* that I just added to the cluster.  Is it possible for 
> that 
> > to corrupt HDFS?  I also had to explicitly kill DN daemons before that, 
> because 
> > bin/stop-all.sh didn't stop them for some reason (it always did so before).
> > 
> > Is there any way to salvage the data?  I have a 4-node cluster with 
> replication 
> > factor of 3, though fsck reports lots of under-replicated blocks:
> > 
> >   ********************************
> >   CORRUPT FILES:        3355
> >   MISSING BLOCKS:       3462
> >   MISSING SIZE:         17708821225 B
> >   ********************************
> > Minimally replicated blocks:   28802 (89.269775 %)
> > Over-replicated blocks:        0 (0.0 %)
> > Under-replicated blocks:       17025 (52.76779 %)
> > Mis-replicated blocks:         0 (0.0 %)
> > Default replication factor:    3
> > Average block replication:     1.7750744
> > Missing replicas:              17025 (29.727087 %)
> > Number of data-nodes:          4
> > Number of racks:               1
> > 
> > 
> > The filesystem under path '/' is CORRUPT
> > 
> > 
> > What can one do at this point to save the data?  If I run bin/hadoop fsck 
> -move 
> > or -delete will I lose some of the data?  Or will I simply end up with 
> > fewer 
> > block replicas and will thus have to force re-balancing in order to get 
> > back 
> to 
> > a "safe" number of replicas?
> > 
> > Thanks,
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Reply via email to