Todd, excellent info, thank you.  I use Ganglia, I will set up nagios
though, good idea.  Just one clarification on Question 1.  What if I
actually lose all my master data dirs, and have no back up on the secondary
name node, are the data blocks on all the slaves lost in that situation?  I
think GoogleFS serializes the data blocks so they can be reasssembled based
on the headers in the data blocks in that scenario.  Just curious if Hadoop
has anything as far as that goes.

-Ryan

On Thu, Jul 23, 2009 at 3:53 PM, Todd Lipcon <t...@cloudera.com> wrote:

> On Thu, Jul 23, 2009 at 11:56 AM, Ryan Smith <ryan.justin.sm...@gmail.com
> >wrote:
>
> > I was wondering if someone could give me some answers or maybe some
> > pointers
> > where to look in the code.  All these questions are in the same vein of
> > hard
> > drive failure.
> >
> > Question 1: If a master (system disks/data) is lost for good, can the
> data
> > on all the slave nodes be recovered? meaning are data blocks serialized
> and
> > rebuildable?
> >
>
> On your namenode you should set dfs.name.dir to several directories. Best
> practice is to use at least three directories, where two are on separate
> local disk drives and the third is on an NFS mount. Additionally, if you
> have a secondary namenode running (and you should) you will end up with
> periodic checkpoints of the namespace data on that machine. So in pretty
> much any hardware failure scenario, you will be able to recover completely
> at best case and to a slightly old snapshot at worst case.
>
>
> >
> > Question 2: If data blocks have different hashes, hows does hadoop handle
> > which block is right during replication?
> >
>
> Blocks are checksummed on the datanode, so corruption can be detected in a
> standalone fashion.
>
>
> >
> > Question 3: How does hadoop handle bad sectors on a disk? For example, on
> a
> > raid, the raid will reject the whole disk.
> >
>
> Hadoop isn't aware of such - it interfaces with the disk through a normal
> local filesystem (eg ext3 or xfs). So those bad sectors will either be
> automatically remapped by the filesystem or generate IO errors on read. IO
> errors will end up as IOExceptions which I believe will trigger the read to
> occur at another replica.
>
>
> >
> > Question 4: If I were to unplug a hot-swap drive, then i were to
> reconnect
> > it a few days later, how does hadoop handle this?  I am assuming that
> > hadoop
> > would see the missing/out of sync data blocks and re-balance?
> >
>
> It doesn't rebalance per se, but the re-instituted blocks will be reported
> to the namenode. If those blocks are for files that have since been
> deleted,
> it will ask the DN to delete the blocks. If they're overreplicated it will
> also cause them to move back to the correct replication level.
>
>
> >
> > Question 5: Can hadoop tell me when a hard drive (a data dir path) is
> going
> > bad? If not, any papers or docs on how tod eal with drive failure would
> be
> > great.
> >
>
> Nope - I recommend you use SMART for this with conjunction with some
> monitoring software like nagios.
>
> -Todd
>

Reply via email to