[ 
https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080667#comment-15080667
 ] 

Kai Zheng commented on HDFS-8430:
---------------------------------

Thanks Nicholas for the correction. Yeah I misunderstood. It's smart to adjust 
the algorithm in the replicated files side to conform with striped files. The 
impact might be big for existing clusters because they will find their 
identical replicated files are not equal now. To avoid the impact, how about 
adding a new API for the new behaviour? In the new approach, we would need to 
introduce {{cell}} similar to striped files for replicated files when computing 
the checksum? If so, how to determine it? When a replicated file is compared to 
a striped file, I guess we can use the cell value used by the striped file for 
the replicated file. But then the cell value needs to be passed into when 
calling {{getFileChecksum}}, which should be fine if we introduce a new API.

I guess you want to use CRC64 to be collision-safer against CRC32 and make 
network traffic smaller against MD5, {{64-bits x numCellsInOneBlock}} instead 
of {{16-bytes x numCellsInOneBlock}}. Please help correct  if I don't get your 
point. Thanks.

> Erasure coding: update DFSClient.getFileChecksum() logic for stripe files
> -------------------------------------------------------------------------
>
>                 Key: HDFS-8430
>                 URL: https://issues.apache.org/jira/browse/HDFS-8430
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7285
>            Reporter: Walter Su
>            Assignee: Kai Zheng
>         Attachments: HDFS-8430-poc1.patch
>
>
> HADOOP-3981 introduces a  distributed file checksum algorithm. It's designed 
> for replicated block.
> {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped 
> block group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to