[ https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080667#comment-15080667 ]
Kai Zheng commented on HDFS-8430: --------------------------------- Thanks Nicholas for the correction. Yeah I misunderstood. It's smart to adjust the algorithm in the replicated files side to conform with striped files. The impact might be big for existing clusters because they will find their identical replicated files are not equal now. To avoid the impact, how about adding a new API for the new behaviour? In the new approach, we would need to introduce {{cell}} similar to striped files for replicated files when computing the checksum? If so, how to determine it? When a replicated file is compared to a striped file, I guess we can use the cell value used by the striped file for the replicated file. But then the cell value needs to be passed into when calling {{getFileChecksum}}, which should be fine if we introduce a new API. I guess you want to use CRC64 to be collision-safer against CRC32 and make network traffic smaller against MD5, {{64-bits x numCellsInOneBlock}} instead of {{16-bytes x numCellsInOneBlock}}. Please help correct if I don't get your point. Thanks. > Erasure coding: update DFSClient.getFileChecksum() logic for stripe files > ------------------------------------------------------------------------- > > Key: HDFS-8430 > URL: https://issues.apache.org/jira/browse/HDFS-8430 > Project: Hadoop HDFS > Issue Type: Sub-task > Affects Versions: HDFS-7285 > Reporter: Walter Su > Assignee: Kai Zheng > Attachments: HDFS-8430-poc1.patch > > > HADOOP-3981 introduces a distributed file checksum algorithm. It's designed > for replicated block. > {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped > block group. -- This message was sent by Atlassian JIRA (v6.3.4#6332)