[jira] Commented: (HADOOP-1134) Block level CRCs in HDFS

Raghu Angadi (JIRA) Mon, 26 Mar 2007 13:55:53 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484240
 ]


Raghu Angadi commented on HADOOP-1134:
--------------------------------------

Outline of requirements and decisions made:
-------------------------------------------------------------

I will include more details on some of these points as the implementation or 
discussion continues. I think I can start implementation of some of the parts.
 
Here DFS refers to Namenode(NN) and Datanodes(DNs), and client refers to 
DFSClient. 
 
1) Checksums are maintained end to end and maintained in datanodes as metadata 
for each block file.  4 byte CRC32 is calculated for for configure size of of 
sub-block  data. Default is 4 byte CRC for each 64KB of block data.
 
2) DN keeps checksums for each blocks as separate files (e.g.: blk_<id>.crc) 
and includes a header at the front of the file. 
 
3) Checksums are calculated by client while writing and passed on to DN. DN 
verifies before writing to disk. DN verifies the checksum each time it reads 
the block data to serve to a client and client verifies it as well. 
 
4) Data transfer protocol between client and datanodes includes inline 
checksums transmitted along with the data on same TCP connection. Client reads 
from a different replica when checksum fails from a DN. 
 
5) When DN notices a checksum failure, it informs namenode. Namenode will treat 
this as a deleted block in initial implementation. Later improvement will delay 
delation of the block until a new valid replica is created. 
  
6) DistributedFileSystem class will not extend ChecksumFileSystem since 
checksums will be integral to DFS. We could have a 
ChecksumDistributedFileSystem if weneed user visible checksums. 
 
7) Upgrade : When DFS is upgraded to this new version, DFS cluster will in 
safemode until all (or most of) the datanodes upgrade thier local files with 
checksums. This process is expected to last for couple of hours. 
 
8) Currently each DFS file has associated checksum stored in ".crc" file. 
During upgrade, datanodes fetch relevant parts of .crc files to verify 
checksums for each block. Since this involves interaction with namenode, 
namenode could be busy or even bottleneck for upgrade. 

9) We haven't decided how and when to delete .crc files. They could be deleted 
by a shell script as well.

Future enhancements : 
--------------------- 
 
1) Bechmark CPU and I/O overhead of checksums on Datanodes. Most of CPU 
overhead could be hidden with overlapping with network and disk I/O. Tests 
showed that java CRC32 takes 5-6 micro seconds for each 1MB of data. Because of 
the overlap I don't expect any noticeable increase in latency because of CPU. 
Disk I/O overhead might contribute more for latency. 
 
2) Based on benchmark tests, investigate if in-memory cache can be used for CRC.
 
3) Datanodes should periodically scan and verify checksums for its blocks. 
 
4) Option to change CRC-block size. E.g. during the upgrade, datanodes maintain 
4 byte CRC for every 512 bytes since ".crc" files used 512 byte chunks. We 
might want to convert them to 4 bytes for every 64K bytes of data. 

5) Namenode should delay deletion of a corrupted block until a new replica is 
created.



> Block level CRCs in HDFS
> ------------------------
>
>                 Key: HADOOP-1134
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1134
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>
> Currently CRCs are handled at FileSystem level and are transparent to core 
> HDFS. See recent improvement HADOOP-928 ( that can add checksums to a given 
> filesystem ) regd more about it. Though this served us well there a few 
> disadvantages :
> 1) This doubles namespace in HDFS ( or other filesystem implementations ). In 
> many cases, it nearly doubles the number of blocks. Taking namenode out of 
> CRCs would nearly double namespace performance both in terms of CPU and 
> memory.
> 2) Since CRCs are transparent to HDFS, it can not actively detect corrupted 
> blocks. With block level CRCs, Datanode can periodically verify the checksums 
> and report corruptions to namnode such that name replicas can be created.
> We propose to have CRCs maintained for all HDFS data in much the same way as 
> in GFS. I will update the jira with detailed requirements and design. This 
> will include same guarantees provided by current implementation and will 
> include a upgrade of current data.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1134) Block level CRCs in HDFS

Reply via email to