On Sep 7, 2011, at 6:19 AM, Marco Cadetg wrote:
> Current situation:
> 3 slaves with each two 320GB disks in RAID 1. All the disks show high read
> errors and io throughput has gone below 5Mb/s without running any hadoop
> job. (It looks like it will fall apart soon...)

         One the special characteristics of RAID1 (and some implementations of 
other striping layouts) is that the speed of the array is (essentially) the 
speed of the slowest disk.  When drives are healthy, this speed is tremendous.  
as soon as one disk starts to head into troubled territories, massive 
performance drops are typical.  Given Hadoop incredible ability to grind drives 
into dust, this tends to happen more often than not. :)  This is one of the 
reasons why RAID isn't generally recommended for data nodes.

> What is the best way to replace the bad disks? I may be able to add another
> two machines into the mix. I can't / won't rebuild the RAID as my new disks
> will be 2TB each, so I wouldn't like to use only 320GB of them.
> 
> Is the best way to add two new nodes into the mix and then mark two other
> machines to dfs.host.exclude. And after some time I can take them out???

        That sounds reasonable with only three slaves.  Just make sure that 
fsck comes back clean and that all files have a rep factor of 3.  Another 
choice is to just replace one machine's drives, wait till fsck comes back 
clean, rinse/repeat.  

Reply via email to