On Thu, Jul 23, 2009 at 11:56 AM, Ryan Smith <ryan.justin.sm...@gmail.com>wrote:
> I was wondering if someone could give me some answers or maybe some > pointers > where to look in the code. All these questions are in the same vein of > hard > drive failure. > > Question 1: If a master (system disks/data) is lost for good, can the data > on all the slave nodes be recovered? meaning are data blocks serialized and > rebuildable? > On your namenode you should set dfs.name.dir to several directories. Best practice is to use at least three directories, where two are on separate local disk drives and the third is on an NFS mount. Additionally, if you have a secondary namenode running (and you should) you will end up with periodic checkpoints of the namespace data on that machine. So in pretty much any hardware failure scenario, you will be able to recover completely at best case and to a slightly old snapshot at worst case. > > Question 2: If data blocks have different hashes, hows does hadoop handle > which block is right during replication? > Blocks are checksummed on the datanode, so corruption can be detected in a standalone fashion. > > Question 3: How does hadoop handle bad sectors on a disk? For example, on a > raid, the raid will reject the whole disk. > Hadoop isn't aware of such - it interfaces with the disk through a normal local filesystem (eg ext3 or xfs). So those bad sectors will either be automatically remapped by the filesystem or generate IO errors on read. IO errors will end up as IOExceptions which I believe will trigger the read to occur at another replica. > > Question 4: If I were to unplug a hot-swap drive, then i were to reconnect > it a few days later, how does hadoop handle this? I am assuming that > hadoop > would see the missing/out of sync data blocks and re-balance? > It doesn't rebalance per se, but the re-instituted blocks will be reported to the namenode. If those blocks are for files that have since been deleted, it will ask the DN to delete the blocks. If they're overreplicated it will also cause them to move back to the correct replication level. > > Question 5: Can hadoop tell me when a hard drive (a data dir path) is going > bad? If not, any papers or docs on how tod eal with drive failure would be > great. > Nope - I recommend you use SMART for this with conjunction with some monitoring software like nagios. -Todd