Doug Cutting wrote:
Hairong Kuang wrote:
Another option is to create a checksum file per block at the data node
where
the block is placed.
Yes, but then we'd need a separate checksum implementation for
intermediate data, and for other distributed filesystems that don't
already guarantee end-to-end data integrity. Also, a checksum per block
would not permit checksums on randomly accessed data without
re-checksumming the entire block. Finally, the checksum wouldn't be
end-to-end. We really want to checksum data as close to its source as
possible, then validate that checksum as close to its use as possible.
DFS checksum need not be for entire block. It could be maintained by
clients for for every 64k (as in GFS paper) which avoids reading the
whole block. This ensures that there is no data corruption on disk,
inside dfs etc. Client protocol can be extended so that client and
datanode exchange and verify checksums. And clients like map/reduce can
further verify checksums received from DFSClient. This could still be
end-to-end, verified hop-to-hop.
How is above different from your original proposal of inline checksums?
Only difference I see is that inline moves checksum management to client
and block checksum moves the management to datanode. With block checksum
every file gets advantage of checksums.
Raghu.
Doug
- Re: inline checksums Raghu Angadi
-