[
https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628814#action_12628814
]
Tsz Wo (Nicholas), SZE commented on HADOOP-3981:
------------------------------------------------
bq. Why not just use the MD5 or SHA1 of the CRCs?
MD5 requires sequential access of data. One easy implementation of
MD5-over-CRCs is that client read all CRCs from Datanodes and then compute MD5
over them. However, it requires reading all the first level CRCs, which is
800MB for a 100GB file. Is it too much network traffic?
Raghu has a very good idea for another implementation, which computes MD5
across datanodes as follow: Client initiates the Datanode 1 (which has the
first block) to compute MD5. Datanode 1 returns the intermediate status of MD5
computation to the Client and the Client send the intermediate states to
Datanode 2 (which has the second block). Then, the Datanode 2 continues the
MD5 computation and return the MD5 computation intermediate status to the
Client, and so on.
Note that this is not a parallel algorithm although it is a distributed
algorithm. Another problem for MD5 in this implemenation is that there is no
easy way to get the MD5 computation intermediate status in Java 1.6.
bq. It is more appealing to have a small, fixed size checksum.
This is probably good. I will think about this.
> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
> Key: HADOOP-3981
> URL: https://issues.apache.org/jira/browse/HADOOP-3981
> Project: Hadoop Core
> Issue Type: New Feature
> Components: dfs
> Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading
> the entire input message sequentially in a central location. HDFS supports
> large files with multiple tera bytes. The overhead of reading the entire
> file is huge. A distributed file checksum algorithm is needed for HDFS.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.