[
https://issues.apache.org/jira/browse/HADOOP-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485026
]
Sameer Paranjpye commented on HADOOP-1134:
------------------------------------------
> Sameer, does the following sum up your proposal (a refinement of option (3)
> above) :
>
> 1) For each blocks for which CRC is generated (in one of the ways mentioned
> below), Datanode reports CRC of the checksum file to namenode.
>
>
Not really. I don't think I mentioned the Namenode keeping CRCs of block CRCs.
There is another issue that appears not to have been dealt with yet, we say
that the Datanode locally generates CRCs if the .crc file for a block is
unavailable. We need to be very careful here IMO. During registration many many
blocks are missing simply because they haven't been reported in yet. We don't
want the Namenode to report missing .crcs to the Datanodes until a certain
threshold of blocks has been reached. This should probably be the same as
dfs.safemode.threshold.pct.
I would propose the following:
1) Extend the Namenode interface, add a getChecksumAuthority(long blockId)
method. This method takes a block-id as input and responds a <type, authority,
offset> tuple, where 'authority' is the name of the .crc file in the Namenode
and 'offset' is the offset of the specified block in the *data* file. It throws
an appropriate exception when the input block-id is unrecognized or belongs to
a .crc file or the checksum authority is missing. The 'type' field indicates
the authority type which is either CRCFILE, DATANODE or YOU. The latter two
codes are used by the Namenode when it has determined that blocks of a .crc
file are missing.
2) Each Datanode does the following during the upgrade:
- For each block it calls getChecksumAuthority(), discovers the checksum
file, opens it for read, reads the header and discovers the bytes/chksum for
the current block
- If getChecksumAuthority() fails, the Datanode moves on to the next block,
it will return to this block when it has run through all it's remaining blocks
- It uses the bytes/chksum and the data offset to determine where in the .crc
file the current blocks checksums lie
- It reads the checksums from one of the replicas and validates the block
data against the checksums, if validation succeeds, the checksum data is
written to disk and it moves on to the next block. The checksum upgrade for the
current block is reported to the Namenode.
- If validation fails it tries the other replicas of the checksum data. If
validation fails against all checksum replicas it arbitrarily chooses one
replica, copies checksum data from it and reports a corrupt block to the
Namenode
3) When the Namenode determines that the .crc file corresponding to a block is
unavailable, it chooses a representative from one of the Datanodes hosting the
block to locally generate CRCs for the block. It does so by sending YOU in the
type field when getChecksumAuthority is invoked. For the remaining Datanodes
hosting the block the Namenode sends DATANODE in the type field and asks them
to copy CRCs from the chosen representative.
The upgrade is considered complete when dfs.replication.min replicas of all
known blocks have been transitioned to block level CRCs.
In some cases, this condition will not be met either because some data blocks
are MIA.
During the upgrade process 'dfs -report' should indicate how many blocks have
not been upgraded and for what reason. It should also indicate whether
a) the upgrade is incomplete
b) the upgrade is complete
c) the upgrade is wedged because some blocks are missing
In the case of b) or c) occuring, the sysadmin can issue a 'finishUpgrade'
command to the Namenode which causes the .crc files to be removed and their
blocks marked for deletion. Note that this is different from 'finalizeUpgrade'
which causes state from the previous version to be discarded. Datanodes that
join the system after the upgrade is finished are handled using 3) above.
This is a more complex proposal, but the additional complexity has been
introduced in order to provide much stronger correctness guarantees, so I feel
that it is warranted. Comments welcome.
> Block level CRCs in HDFS
> ------------------------
>
> Key: HADOOP-1134
> URL: https://issues.apache.org/jira/browse/HADOOP-1134
> Project: Hadoop
> Issue Type: New Feature
> Components: dfs
> Reporter: Raghu Angadi
> Assigned To: Raghu Angadi
>
> Currently CRCs are handled at FileSystem level and are transparent to core
> HDFS. See recent improvement HADOOP-928 ( that can add checksums to a given
> filesystem ) regd more about it. Though this served us well there a few
> disadvantages :
> 1) This doubles namespace in HDFS ( or other filesystem implementations ). In
> many cases, it nearly doubles the number of blocks. Taking namenode out of
> CRCs would nearly double namespace performance both in terms of CPU and
> memory.
> 2) Since CRCs are transparent to HDFS, it can not actively detect corrupted
> blocks. With block level CRCs, Datanode can periodically verify the checksums
> and report corruptions to namnode such that name replicas can be created.
> We propose to have CRCs maintained for all HDFS data in much the same way as
> in GFS. I will update the jira with detailed requirements and design. This
> will include same guarantees provided by current implementation and will
> include a upgrade of current data.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.