We are continuing to see a small, consistent amount of block
corruption leading to file loss. We have been upgrading our cluster
lately, which means we've been doing a rolling de-commissioning of our
nodes (and then adding them back with more disks!).
Previously, when I've had time to investigate this very deeply, I've
found issues like these:
https://issues.apache.org/jira/browse/HADOOP-4692
https://issues.apache.org/jira/browse/HADOOP-4543
I suspect that this causes some or all of our problems.
I also saw that one of our nodes was at 100.2% full; I think this is
due to the same issue; Hadoop's actual usage of the file system is
greater than the max capacity because some of the blocks were truncated.
I'd have to check with our sysadmins, but I think we've lost about
200-300 files during the upgrade process. Right now, there are about
900 chronically under-replicated blocks; in the past, that's meant the
only replica is actually corrupt, and Hadoop is trying to relentlessly
retransfer it, failing to, but not realizing the source is corrupt.
To some extent, this whole issue is caused because we only have enough
space for 2 replicas; I'd imagine that at 3 replicas, the issue would
be much harder to trigger.
Any suggestions? For us, file loss is something we can deal with (not
necessarily fun to deal with, of course), but it might not be the case
in the future.
Brian