[ 
https://issues.apache.org/jira/browse/HADOOP-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485026
 ] 

Sameer Paranjpye commented on HADOOP-1134:
------------------------------------------

> Sameer, does the following sum up your proposal (a refinement of option (3) 
> above) :
>
> 1) For each blocks for which CRC is generated (in one of the ways mentioned 
> below), Datanode reports CRC of the checksum file to namenode.
>
>

Not really. I don't think I mentioned the Namenode keeping CRCs of block CRCs. 

There is another issue that appears not to have been dealt with yet, we say 
that the Datanode locally generates CRCs if the .crc file for a block is 
unavailable. We need to be very careful here IMO. During registration many many 
blocks are missing simply because they haven't been reported in yet. We don't 
want the Namenode to report missing .crcs to the Datanodes until a certain 
threshold of blocks has been reached. This should probably be the same as 
dfs.safemode.threshold.pct.

I would propose the following:

1) Extend the Namenode interface, add a  getChecksumAuthority(long blockId) 
method. This method takes a block-id as input and responds a <type, authority, 
offset> tuple, where 'authority' is the name of the .crc file in the Namenode 
and 'offset' is the offset of the specified block in the *data* file. It throws 
an appropriate exception when the input block-id is unrecognized or belongs to 
a .crc file or the checksum authority is missing. The 'type' field indicates 
the authority type which is either CRCFILE, DATANODE or YOU. The latter two 
codes are used by the Namenode when it has determined that blocks of a .crc 
file are missing.

2) Each Datanode does the following during the upgrade:
  - For each block it calls getChecksumAuthority(), discovers the checksum 
file, opens it for read, reads the header and discovers the bytes/chksum for 
the current block
  - If getChecksumAuthority() fails, the Datanode moves on to the next block, 
it will return to this block when it has run through all it's remaining blocks
  - It uses the bytes/chksum and the data offset to determine where in the .crc 
file the current blocks checksums lie
  - It reads the checksums from one of the replicas and validates the block 
data against the checksums, if validation succeeds, the checksum data is 
written to disk and it moves on to the next block. The checksum upgrade for the 
current block is reported to the Namenode.
  - If validation fails it tries the other replicas of the checksum data. If 
validation fails against all checksum replicas it arbitrarily chooses one 
replica, copies checksum data from it and reports a corrupt block to the 
Namenode

3) When the Namenode determines that the .crc file corresponding to a block is 
unavailable, it chooses a representative from one of the Datanodes hosting the 
block to locally generate CRCs for the block. It does so by sending YOU in the 
type field when getChecksumAuthority is invoked. For the remaining Datanodes 
hosting the block the Namenode sends DATANODE in the type field and asks them 
to copy CRCs from the chosen representative.

The upgrade is considered complete when dfs.replication.min replicas of all 
known blocks have been transitioned to block level CRCs. 

In some cases, this condition will not be met either because some data blocks 
are MIA.

During the upgrade process 'dfs -report' should indicate how many blocks have 
not been upgraded and for what reason. It should also indicate whether 
a) the upgrade is incomplete
b) the upgrade is complete
c) the upgrade is wedged because some blocks are missing

In the case of b) or c) occuring, the sysadmin can issue a 'finishUpgrade' 
command to the Namenode which causes the .crc files to be removed and their 
blocks marked for deletion. Note that this is different from 'finalizeUpgrade' 
which causes state from the previous version to be discarded. Datanodes that 
join the system after the upgrade is finished are handled using 3) above.

This is a more complex proposal, but the additional complexity has been 
introduced in order to provide much stronger correctness guarantees, so I feel 
that it is warranted. Comments welcome.














> Block level CRCs in HDFS
> ------------------------
>
>                 Key: HADOOP-1134
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1134
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>
> Currently CRCs are handled at FileSystem level and are transparent to core 
> HDFS. See recent improvement HADOOP-928 ( that can add checksums to a given 
> filesystem ) regd more about it. Though this served us well there a few 
> disadvantages :
> 1) This doubles namespace in HDFS ( or other filesystem implementations ). In 
> many cases, it nearly doubles the number of blocks. Taking namenode out of 
> CRCs would nearly double namespace performance both in terms of CPU and 
> memory.
> 2) Since CRCs are transparent to HDFS, it can not actively detect corrupted 
> blocks. With block level CRCs, Datanode can periodically verify the checksums 
> and report corruptions to namnode such that name replicas can be created.
> We propose to have CRCs maintained for all HDFS data in much the same way as 
> in GFS. I will update the jira with detailed requirements and design. This 
> will include same guarantees provided by current implementation and will 
> include a upgrade of current data.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to