[ 
https://issues.apache.org/jira/browse/HADOOP-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484957
 ] 

Sameer Paranjpye commented on HADOOP-1134:
------------------------------------------

This is complicated. The general design direction appears to be the right one, 
but I think some key details need to be spelled  out.

We say we're going to use the default of 512 bytes/chksum. What do we do for 
HDFS installations that use a different value in configuration? What do we do 
for installations that have used more than one value of bytes/chksum when 
generating data?

One option is to simply use the existing checksum data, as is, with the 
understanding that we could end up with different values of bytes/chksum across 
HDFS installations and across different files in the same installation. The 
alternative would be to re-generate checksum data on the Datanodes with 512 
bytes/chksum and validate the new checksums against the existing data.

Do we keep the io.bytes.per.checksum configuration variable or do we kill it? 

Do we simply copy existing checksum data or do we re-generate it?

I don't think simply copying checksum data is enough since the checksums can 
themselves be corrupt. We need some level of validation. We can compare copies 
of the checksum data against each other, if we find a majority of copies that 
match then we treat those as authoritative. But what happens when we don't find 
a majority? Or we can re-generate checksum data on the Datanode and validate it 
against the existing data.

How does a Datanode discover authoritative sources of checksum data for it's 
blocks?

This is presumably done with a call to the Namenode that given a block id 
responds with the name of a checksum file. The Datanode then reads the header, 
determines the offset and length where the specified blocks checksums lie, then 
reads the checksum data and validates it. This works while the upgrade is in 
progress but perhaps it can be extended to deal with Datanodes that join the 
system after the upgrade is complete. If a Datanode joins after a complete 
upgrade and crc file deletion, the Namenode could redirect it to other 
Datanodes that have copies of it's blocks, the new Datanode can then pull block 
level CRC files from it's peers, validate it's data and  perform an upgrade 
even though the .crc files are gone.

Thoughts?



> Block level CRCs in HDFS
> ------------------------
>
>                 Key: HADOOP-1134
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1134
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>
> Currently CRCs are handled at FileSystem level and are transparent to core 
> HDFS. See recent improvement HADOOP-928 ( that can add checksums to a given 
> filesystem ) regd more about it. Though this served us well there a few 
> disadvantages :
> 1) This doubles namespace in HDFS ( or other filesystem implementations ). In 
> many cases, it nearly doubles the number of blocks. Taking namenode out of 
> CRCs would nearly double namespace performance both in terms of CPU and 
> memory.
> 2) Since CRCs are transparent to HDFS, it can not actively detect corrupted 
> blocks. With block level CRCs, Datanode can periodically verify the checksums 
> and report corruptions to namnode such that name replicas can be created.
> We propose to have CRCs maintained for all HDFS data in much the same way as 
> in GFS. I will update the jira with detailed requirements and design. This 
> will include same guarantees provided by current implementation and will 
> include a upgrade of current data.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to