File loss at Nebraska

Brian Bockelman Fri, 05 Dec 2008 17:00:25 -0800

We are continuing to see a small, consistent amount of blockcorruption leading to file loss. We have been upgrading our clusterlately, which means we've been doing a rolling de-commissioning of ournodes (and then adding them back with more disks!).

Previously, when I've had time to investigate this very deeply, I'vefound issues like these:


https://issues.apache.org/jira/browse/HADOOP-4692
https://issues.apache.org/jira/browse/HADOOP-4543

I suspect that this causes some or all of our problems.

I also saw that one of our nodes was at 100.2% full; I think this isdue to the same issue; Hadoop's actual usage of the file system isgreater than the max capacity because some of the blocks were truncated.

I'd have to check with our sysadmins, but I think we've lost about200-300 files during the upgrade process. Right now, there are about900 chronically under-replicated blocks; in the past, that's meant theonly replica is actually corrupt, and Hadoop is trying to relentlesslyretransfer it, failing to, but not realizing the source is corrupt.To some extent, this whole issue is caused because we only have enoughspace for 2 replicas; I'd imagine that at 3 replicas, the issue wouldbe much harder to trigger.

Any suggestions? For us, file loss is something we can deal with (notnecessarily fun to deal with, of course), but it might not be the casein the future.


Brian

File loss at Nebraska

Reply via email to