Re: A few questions about Hadoop and hard-drive failure handling.

Todd Lipcon Thu, 23 Jul 2009 12:54:19 -0700

On Thu, Jul 23, 2009 at 11:56 AM, Ryan Smith <ryan.justin.sm...@gmail.com>wrote:


> I was wondering if someone could give me some answers or maybe some
> pointers
> where to look in the code.  All these questions are in the same vein of
> hard
> drive failure.
>
> Question 1: If a master (system disks/data) is lost for good, can the data
> on all the slave nodes be recovered? meaning are data blocks serialized and
> rebuildable?
>

On your namenode you should set dfs.name.dir to several directories. Best
practice is to use at least three directories, where two are on separate
local disk drives and the third is on an NFS mount. Additionally, if you
have a secondary namenode running (and you should) you will end up with
periodic checkpoints of the namespace data on that machine. So in pretty
much any hardware failure scenario, you will be able to recover completely
at best case and to a slightly old snapshot at worst case.


>
> Question 2: If data blocks have different hashes, hows does hadoop handle
> which block is right during replication?
>

Blocks are checksummed on the datanode, so corruption can be detected in a
standalone fashion.


>
> Question 3: How does hadoop handle bad sectors on a disk? For example, on a
> raid, the raid will reject the whole disk.
>

Hadoop isn't aware of such - it interfaces with the disk through a normal
local filesystem (eg ext3 or xfs). So those bad sectors will either be
automatically remapped by the filesystem or generate IO errors on read. IO
errors will end up as IOExceptions which I believe will trigger the read to
occur at another replica.


>
> Question 4: If I were to unplug a hot-swap drive, then i were to reconnect
> it a few days later, how does hadoop handle this?  I am assuming that
> hadoop
> would see the missing/out of sync data blocks and re-balance?
>

It doesn't rebalance per se, but the re-instituted blocks will be reported
to the namenode. If those blocks are for files that have since been deleted,
it will ask the DN to delete the blocks. If they're overreplicated it will
also cause them to move back to the correct replication level.


>
> Question 5: Can hadoop tell me when a hard drive (a data dir path) is going
> bad? If not, any papers or docs on how tod eal with drive failure would be
> great.
>

Nope - I recommend you use SMART for this with conjunction with some
monitoring software like nagios.

-Todd

Re: A few questions about Hadoop and hard-drive failure handling.

Reply via email to