[ 
https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080672#comment-15080672
 ] 

Walter Su commented on HDFS-8430:
---------------------------------

bq. 1. Use CRC64 (or some other linear code) for block checksum instead of MD5.
Agreed. CRC works fine as hash function. Our purpose is file comparison. MD5 is 
overkill.
MD5 is 128bits, I think you mean CRC128?

bq. The datanode may compute cell CRC64s...
We may have many policies, and many cellSize. Let's say minimal cell size is 
64k. You mean calculate a CRC per 64k (instead of default value of 
_dfs.bytes-per-checksum_) ? It does reduce network traffic. But I thought we 
could use the block metadata which already has the CRCs and avoid 
re-calculation.

bq. Instead of sending all CRCs to the client, send all CRCs to one of the 
datanode in a block group. 
Either way, we still need to fetch all CRCs from 6(or 9) DNs, and change the 
ordering. So the hash value can be the same as replicated block.

bq. The hard part would be to consider the block missing, decoding and checksum 
computing case.
Agreed.

> Erasure coding: update DFSClient.getFileChecksum() logic for stripe files
> -------------------------------------------------------------------------
>
>                 Key: HDFS-8430
>                 URL: https://issues.apache.org/jira/browse/HDFS-8430
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7285
>            Reporter: Walter Su
>            Assignee: Kai Zheng
>         Attachments: HDFS-8430-poc1.patch
>
>
> HADOOP-3981 introduces a  distributed file checksum algorithm. It's designed 
> for replicated block.
> {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped 
> block group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to