[ 
https://issues.apache.org/jira/browse/HDFS-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171997#comment-13171997
 ] 

M. C. Srivas commented on HDFS-2699:
------------------------------------

@dhruba:

>> a block size of 4096 is too large for the CRC

>the hbase block size is 16K. The hdfs checksum size is 4K. The hdfs block size 
>is 256 MB. which one r u referring to >here? Can you pl explain the 
>read-modify-write cycle? HDFS does mostly large sequential writes (no 
>overwrites).

The CRC block size. (that is, the contiguous region of the file that a CRC 
covers).  Modifying any portion of that region will require that the entire 
data for the region be read in, and the CRC recomputed for that entire region 
and the entire region written out again.


Note that it also introduces a new failure mode ... data that was previously 
written safely a long time ago could be now deemed "corrupt" since the CRC is 
no-longer good due to a minor modification during an append. The failure 
scenario is as follows:

1. A thread writes to a file and closes it. Lets say the file length is 9K.  
There are 3 CRCs embedded inline -- one for 0-4K, one for 4K-8K, and one for 
8K-9K. Call the last one CRC3.

2. An append happens a few days later to extend the file from 9K to 11K. CRC3 
is now recomputed for the 3K-sized region spanning offsets 8K-11K and written 
out as CRC3-new. But there is a crash, and the entire 3K is not all written out 
cleanly (CRC3-new and some data in written out before the crash -- all 3 copies 
crash and recover).

3. A subsequent read on the region 8K-9K now fails with a CRC error ... even 
though the write was stable and used to succeed before.

If this file was the HBase WAL, wouldn't this result in a data loss?


                
> Store data and checksums together in block file
> -----------------------------------------------
>
>                 Key: HDFS-2699
>                 URL: https://issues.apache.org/jira/browse/HDFS-2699
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> The current implementation of HDFS stores the data in one block file and the 
> metadata(checksum) in another block file. This means that every read from 
> HDFS actually consumes two disk iops, one to the datafile and one to the 
> checksum file. This is a major problem for scaling HBase, because HBase is 
> usually  bottlenecked on the number of random disk iops that the 
> storage-hardware offers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to