On Thu, Oct 16, 2014 at 10:01 PM, Colin Kincaid Williams <disc...@uw.edu> wrote:
> For some reason he seems intent on resetting the bad Virtual blocks, and > giving the drives another shot. From what he told me, nothing is under > warranty anymore. My first suggestion was to get rid of the disks. > > Here's the command: > > /opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks > controller=1 vdisk=$vid > Well, the usefulness of this action is going to entirely depend on how you've actually set up the virtual disks. If you've set it up so there's only one physical disk in each vdisk (single-disk RAID0), then the bad "virtual" block is likely going to map to a real bad block. If you're doing something where there are multiple disks associated with each virtual disk (eg, RAID1, RAID10 ... can't remember if RAID5/RAID6 can exhibit what follows), it's possible for the virtual device to have a bad block that is actually mapped to a good physical block underneath. This can happen, for example, if you had a failing drive in the vdisk and replaced it, but the controller had remapped the bad virtual block to some place good. Replacing the drive with a good one makes the controller think the bad block is still there. Dell calls it a punctured stripe (for better description see http://lists.us.dell.com/pipermail/linux-poweredge/2010-December/043832.html). In this case, the fix is clearing the virtual badblock list with the above command. > I'm still curious about how hadoop blocks work. I'm assuming that each > block is stored on one of the many mountpoints, and not divided between > them. I know there is a tolerated volume failure option in hdfs-site.xml. > Correct. Each HDFS block is actually treated as a file that lives on a regular filesystem, like ext3 or ext4. If you did an ls inside one of your vdisk's, you'd see the raw blocks that the datanode is actually storing. You just wouldn't be able to easily tell what file that block was a part of because it's named with a block id, not the actual file name. > Then if the operations I laid out are legitimate, specifically removing > the drive in question and restarting the data node. The advantage being > less re-replication and less downtime. > > Yup. It will minimize the actual prolonged outage of the datanode itself. You'll get a little re-replication while the datanode process is off, but if you keep that time reasonably short, you should be fine. When the datanode process comes back up, it will walk all of it's configured filesystems determining which blocks it still has on disk and report that back to the namenode. Once that happens, re-replication will stop because the namenode knows where those missing blocks are and no longer treat them as under-replicated. Note: You'll still get some re-replication occurring for the blocks that lived on the drive you removed. But it's only a drive's worth of blocks, not a whole datanode. Travis -- Travis Campbell tra...@ghostar.org