[ Moving discussion to hadoop-dev.  -drc ]

Raghu Angadi wrote:
This is good validation how important ECC memory is. Currently HDFS client deletes a block when it notices a checksum error. After moving to Block level CRCs soon, we should make Datanode re-validate the block before deciding to delete it.

It also emphasizes how important end-to-end checksums are. Data should also be checksummed as soon as possible after it is generated, before it has a chance to be corrupted.

Ideally, the initial buffer that stores the data should be small, and data should be checksummed as this initial buffer is flushed. In the current implementation, the small checksum buffer is the second buffer, the initial buffer is the larger, io.buffer.size buffer. To provide maximum protection against memory errors, this situation should be reversed.

This is discussed in https://issues.apache.org/jira/browse/HADOOP-928. Perhaps a new issue should be filed to reverse the order of these buffers, so that data is checksummed before entering the larger, longer-lived buffer?

Doug

Reply via email to