[
https://issues.apache.org/jira/browse/HADOOP-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485317
]
Doug Cutting commented on HADOOP-1134:
--------------------------------------
> When the HDFS client encounters a checksum error, it doesn't know whether it
> is the data or the checksum that is corrupt.
Okay, I see your point. If we import only a single replica of the checksums on
upgrade then we'd increase the false-positive rate of checksum errors, right?
But for every false-positive we'd still have a 100 real data corruptions, so
I'm not sure this is a big deal.
> The system as it exists today deals with the possibility of checksum
> corruption [ ... ]
Yes, in some cases. If the corruption happened to a checksum on a datanode,
then it does. If it happened before the data reached the datanode, then it
doesn't.
So, sure, we could read all copies of the checksum data when updating and vote.
It adds complexity, introducing potential bugs, but with a some benefit. +0
However I don't see how checking the data against the checksums during upgrade
helps much. If they don't agree, the block's probably corrupt, but the
checksum could be (or both could be). It seems the best we can do in this case
is let the client discover the problem if/when the data is read, and, if they
insist, use the unvalidated data. Or am I missing something?
> Block level CRCs in HDFS
> ------------------------
>
> Key: HADOOP-1134
> URL: https://issues.apache.org/jira/browse/HADOOP-1134
> Project: Hadoop
> Issue Type: New Feature
> Components: dfs
> Reporter: Raghu Angadi
> Assigned To: Raghu Angadi
>
> Currently CRCs are handled at FileSystem level and are transparent to core
> HDFS. See recent improvement HADOOP-928 ( that can add checksums to a given
> filesystem ) regd more about it. Though this served us well there a few
> disadvantages :
> 1) This doubles namespace in HDFS ( or other filesystem implementations ). In
> many cases, it nearly doubles the number of blocks. Taking namenode out of
> CRCs would nearly double namespace performance both in terms of CPU and
> memory.
> 2) Since CRCs are transparent to HDFS, it can not actively detect corrupted
> blocks. With block level CRCs, Datanode can periodically verify the checksums
> and report corruptions to namnode such that name replicas can be created.
> We propose to have CRCs maintained for all HDFS data in much the same way as
> in GFS. I will update the jira with detailed requirements and design. This
> will include same guarantees provided by current implementation and will
> include a upgrade of current data.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.