Another option is to create a checksum file per block at the data node where the block is placed. This approach clearly separates data and checksums and does not requires too much changes for open(), seek() and length(). For create, when a block is written to a data node, the data node creates a checksum file at the same time.
We could have the same checksum file naming convention as it is now. A checksum file is named after its block file name with a "." prefix and a ".crc" suffix. When upgrade, a name node removes all checksum files from its namespace. Data nodes create a checksum file per block if the checksum file does not exist. Hairong -----Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 23, 2007 3:51 PM To: [email protected] Subject: inline checksums The current checksum implementation writes CRC32 values to a parallel file. Unfortunately these parallel files pollute the namespace. In particular, this places a heavier burden on the HDFS namenode. Perhaps we should consider placing checksums inline in file data. For example, we might write the data as a sequence of fixed-size <checksum><payload> entries. This could be implemented as a FileSystem wrapper, ChecksummedFileSystem. The create() method would return a stream that uses a small buffer that checksums data as it arrives, then writes the checksums in front of the data as the buffer is flushed. The open() method could similarly check each buffer as it is read. The seek() and length() methods would adjust for the interpolated checksums. Checksummed files could have their names suffixed internally with something like ".hcs0". Checksum processing would be skipped for files without this suffix, for back-compatibility and interoperability. Directory listings would be modified to remove this suffix. Existing checksum code in FileSystem.java could be removed, including all 'raw' methods. HDFS would use ChecksummedFileSystem. If block names were modified to encode the checksum version, then datanodes could validate checksums. (We could ensure that checksum boundaries are aligned with block boundaries.) We could have two versions of the local filesystem: one with checksums and one without. The DFS shell could use the checksumless version for exporting files, while MapReduce could use the checksummed version for intermediate data. S3 might use this, or might not, if we think that Amazon already provides sufficient data integrity. Thoughts? Doug
