Doug Cutting wrote:
Hairong Kuang wrote:
Another option is to create a checksum file per block at the data node where
the block is placed.

Yes, but then we'd need a separate checksum implementation for intermediate data, and for other distributed filesystems that don't already guarantee end-to-end data integrity. Also, a checksum per block would not permit checksums on randomly accessed data without re-checksumming the entire block. Finally, the checksum wouldn't be end-to-end. We really want to checksum data as close to its source as possible, then validate that checksum as close to its use as possible.

DFS checksum need not be for entire block. It could be maintained by clients for for every 64k (as in GFS paper) which avoids reading the whole block. This ensures that there is no data corruption on disk, inside dfs etc. Client protocol can be extended so that client and datanode exchange and verify checksums. And clients like map/reduce can further verify checksums received from DFSClient. This could still be end-to-end, verified hop-to-hop.

How is above different from your original proposal of inline checksums? Only difference I see is that inline moves checksum management to client and block checksum moves the management to datanode. With block checksum every file gets advantage of checksums.

Raghu.

Doug

Reply via email to