[
https://issues.apache.org/jira/browse/HADOOP-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484240
]
Raghu Angadi commented on HADOOP-1134:
--------------------------------------
Outline of requirements and decisions made:
-------------------------------------------------------------
I will include more details on some of these points as the implementation or
discussion continues. I think I can start implementation of some of the parts.
Here DFS refers to Namenode(NN) and Datanodes(DNs), and client refers to
DFSClient.
1) Checksums are maintained end to end and maintained in datanodes as metadata
for each block file. 4 byte CRC32 is calculated for for configure size of of
sub-block data. Default is 4 byte CRC for each 64KB of block data.
2) DN keeps checksums for each blocks as separate files (e.g.: blk_<id>.crc)
and includes a header at the front of the file.
3) Checksums are calculated by client while writing and passed on to DN. DN
verifies before writing to disk. DN verifies the checksum each time it reads
the block data to serve to a client and client verifies it as well.
4) Data transfer protocol between client and datanodes includes inline
checksums transmitted along with the data on same TCP connection. Client reads
from a different replica when checksum fails from a DN.
5) When DN notices a checksum failure, it informs namenode. Namenode will treat
this as a deleted block in initial implementation. Later improvement will delay
delation of the block until a new valid replica is created.
6) DistributedFileSystem class will not extend ChecksumFileSystem since
checksums will be integral to DFS. We could have a
ChecksumDistributedFileSystem if weneed user visible checksums.
7) Upgrade : When DFS is upgraded to this new version, DFS cluster will in
safemode until all (or most of) the datanodes upgrade thier local files with
checksums. This process is expected to last for couple of hours.
8) Currently each DFS file has associated checksum stored in ".crc" file.
During upgrade, datanodes fetch relevant parts of .crc files to verify
checksums for each block. Since this involves interaction with namenode,
namenode could be busy or even bottleneck for upgrade.
9) We haven't decided how and when to delete .crc files. They could be deleted
by a shell script as well.
Future enhancements :
---------------------
1) Bechmark CPU and I/O overhead of checksums on Datanodes. Most of CPU
overhead could be hidden with overlapping with network and disk I/O. Tests
showed that java CRC32 takes 5-6 micro seconds for each 1MB of data. Because of
the overlap I don't expect any noticeable increase in latency because of CPU.
Disk I/O overhead might contribute more for latency.
2) Based on benchmark tests, investigate if in-memory cache can be used for CRC.
3) Datanodes should periodically scan and verify checksums for its blocks.
4) Option to change CRC-block size. E.g. during the upgrade, datanodes maintain
4 byte CRC for every 512 bytes since ".crc" files used 512 byte chunks. We
might want to convert them to 4 bytes for every 64K bytes of data.
5) Namenode should delay deletion of a corrupted block until a new replica is
created.
> Block level CRCs in HDFS
> ------------------------
>
> Key: HADOOP-1134
> URL: https://issues.apache.org/jira/browse/HADOOP-1134
> Project: Hadoop
> Issue Type: New Feature
> Components: dfs
> Reporter: Raghu Angadi
> Assigned To: Raghu Angadi
>
> Currently CRCs are handled at FileSystem level and are transparent to core
> HDFS. See recent improvement HADOOP-928 ( that can add checksums to a given
> filesystem ) regd more about it. Though this served us well there a few
> disadvantages :
> 1) This doubles namespace in HDFS ( or other filesystem implementations ). In
> many cases, it nearly doubles the number of blocks. Taking namenode out of
> CRCs would nearly double namespace performance both in terms of CPU and
> memory.
> 2) Since CRCs are transparent to HDFS, it can not actively detect corrupted
> blocks. With block level CRCs, Datanode can periodically verify the checksums
> and report corruptions to namnode such that name replicas can be created.
> We propose to have CRCs maintained for all HDFS data in much the same way as
> in GFS. I will update the jira with detailed requirements and design. This
> will include same guarantees provided by current implementation and will
> include a upgrade of current data.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.