[
https://issues.apache.org/jira/browse/HADOOP-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484974
]
Sameer Paranjpye commented on HADOOP-1134:
------------------------------------------
> If we cannot get old CRC data for any reason, we will generate one based on
> the local data (which could be wrong). There are two options to validate
> upgraded data (for
> simplicity all the details and error conditions are not explained) :
> 1) use old CRCs (Doug's choice)
> 2) check CRC of each replica and choose the majority (Sameer's choice)
> 3) Combination of (1) and (2). i.e. use (2) if (1) fails etc. This option is
> proposed only now.
Even if we *can* get the old CRC data, how do we know that it is not corrupt?
There are 3 copies of each CRC file, one or more of these could be corrupt. We
need some way to ensure that we're copying correct checksum data to the
Datanode. As I said before, we can do this by comparing copies of the existing
CRC data against each other and electing a set of authorities OR by validating
checksum data that we pull against the local blocks.
> Block level CRCs in HDFS
> ------------------------
>
> Key: HADOOP-1134
> URL: https://issues.apache.org/jira/browse/HADOOP-1134
> Project: Hadoop
> Issue Type: New Feature
> Components: dfs
> Reporter: Raghu Angadi
> Assigned To: Raghu Angadi
>
> Currently CRCs are handled at FileSystem level and are transparent to core
> HDFS. See recent improvement HADOOP-928 ( that can add checksums to a given
> filesystem ) regd more about it. Though this served us well there a few
> disadvantages :
> 1) This doubles namespace in HDFS ( or other filesystem implementations ). In
> many cases, it nearly doubles the number of blocks. Taking namenode out of
> CRCs would nearly double namespace performance both in terms of CPU and
> memory.
> 2) Since CRCs are transparent to HDFS, it can not actively detect corrupted
> blocks. With block level CRCs, Datanode can periodically verify the checksums
> and report corruptions to namnode such that name replicas can be created.
> We propose to have CRCs maintained for all HDFS data in much the same way as
> in GFS. I will update the jira with detailed requirements and design. This
> will include same guarantees provided by current implementation and will
> include a upgrade of current data.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.