The current checksum implementation writes CRC32 values to a parallel
file. Unfortunately these parallel files pollute the namespace. In
particular, this places a heavier burden on the HDFS namenode.
Perhaps we should consider placing checksums inline in file data. For
example, we might write the data as a sequence of fixed-size
<checksum><payload> entries. This could be implemented as a FileSystem
wrapper, ChecksummedFileSystem. The create() method would return a
stream that uses a small buffer that checksums data as it arrives, then
writes the checksums in front of the data as the buffer is flushed. The
open() method could similarly check each buffer as it is read. The
seek() and length() methods would adjust for the interpolated checksums.
Checksummed files could have their names suffixed internally with
something like ".hcs0". Checksum processing would be skipped for files
without this suffix, for back-compatibility and interoperability.
Directory listings would be modified to remove this suffix.
Existing checksum code in FileSystem.java could be removed, including
all 'raw' methods.
HDFS would use ChecksummedFileSystem. If block names were modified to
encode the checksum version, then datanodes could validate checksums.
(We could ensure that checksum boundaries are aligned with block
boundaries.)
We could have two versions of the local filesystem: one with checksums
and one without. The DFS shell could use the checksumless version for
exporting files, while MapReduce could use the checksummed version for
intermediate data.
S3 might use this, or might not, if we think that Amazon already
provides sufficient data integrity.
Thoughts?
Doug
- inline checksums Doug Cutting
-