[ https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080672#comment-15080672 ]
Walter Su commented on HDFS-8430: --------------------------------- bq. 1. Use CRC64 (or some other linear code) for block checksum instead of MD5. Agreed. CRC works fine as hash function. Our purpose is file comparison. MD5 is overkill. MD5 is 128bits, I think you mean CRC128? bq. The datanode may compute cell CRC64s... We may have many policies, and many cellSize. Let's say minimal cell size is 64k. You mean calculate a CRC per 64k (instead of default value of _dfs.bytes-per-checksum_) ? It does reduce network traffic. But I thought we could use the block metadata which already has the CRCs and avoid re-calculation. bq. Instead of sending all CRCs to the client, send all CRCs to one of the datanode in a block group. Either way, we still need to fetch all CRCs from 6(or 9) DNs, and change the ordering. So the hash value can be the same as replicated block. bq. The hard part would be to consider the block missing, decoding and checksum computing case. Agreed. > Erasure coding: update DFSClient.getFileChecksum() logic for stripe files > ------------------------------------------------------------------------- > > Key: HDFS-8430 > URL: https://issues.apache.org/jira/browse/HDFS-8430 > Project: Hadoop HDFS > Issue Type: Sub-task > Affects Versions: HDFS-7285 > Reporter: Walter Su > Assignee: Kai Zheng > Attachments: HDFS-8430-poc1.patch > > > HADOOP-3981 introduces a distributed file checksum algorithm. It's designed > for replicated block. > {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped > block group. -- This message was sent by Atlassian JIRA (v6.3.4#6332)