[jira] [Commented] (HDFS-2699) Store data and checksums together in block file

Scott Carey (Commented) (JIRA) Fri, 06 Jan 2012 14:11:06 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181643#comment-13181643
 ]


Scott Carey commented on HDFS-2699:
-----------------------------------

@Srivas:
bq. If you want to eventually support random-IO, then a block size of 4096 is 
too large for the CRC, as it will cause a read-modify-write cycle on the entire 
4K. 512-bytes reduces this overhead.

With CRC hardware accelerated now, this is not a big overhead.  Without 
hardware acceleration it is ~800MB/sec for 4096 byte chunks, or 200,000 blocks 
per second or 25% of one CPU load at 200MB/sec writes.  With Hardware 
acceleration this drops by a factor of 4 to 8.  

This is besides the point, a paranoid user could configure smaller CRC chunks 
and test that.  I'm suggesting that 4096 is a much saner default.

@Todd:
bq. Secondly, the disk manufacturers guarantee only a 512-byte atomicity on 
disk. Linux doing a 4K block write guarantees almost nothing wrt atomicity of 
that 4K write to disk. On a crash, unless you are running some sort of RAID or 
data-journal, there is a likelihood of the 4K block that's in-flight getting 
corrupted.

Actually, disk manufacturers are all using 4096 byte atomicity these days 
(starting with 500GB platters for most manufacturers) **.  HDFS should not 
target protecting power_of_two_butes data with a checksum, but rather 
(power_of_two_bytes - checksum_size) data so that the hardware atomicity (and 
OS page cache) lines up exactly with the hdfs checksum chunk + inlined CRC.

@Srivas:
bq. 2. An append happens a few days later to extend the file from 9K to 11K. 
CRC3 is now recomputed for the 3K-sized region spanning offsets 8K-11K and 
written out as CRC3-new. But there is a crash, and the entire 3K is not all 
written out cleanly 

This can be avoided entirely.
A. The OS and Hardware can avoid partial page writes.  ext4 and others can 
avoid partial page writes.  The OS only flushes a page at a time.  Hardware 
these days writes blocks in atomic 4096 byte chunks.
B. The inlined CRC can be done so that a single 4096 byte page in the OS 
contains all of the data and the crc in an atomic chunk, and the CRC and its 
corresponding data are therefore not split across pages.

Under the above conditions, the performance would be excellent, and the data 
safety higher than the current situation or any application level crc (unless 
the application is inlining the crc to prevent splitting the data and crc 
across pages).

About the transition to 4096 byte blocks on Hard drives ("Advanced Format" 
disks):
http://www.zdnet.com/blog/storage/are-you-ready-for-4k-sector-drives/731
http://en.wikipedia.org/wiki/Advanced_Format
http://www.seagate.com/docs/pdf/whitepaper/tp613_transition_to_4k_sectors.pdf
http://lwn.net/Articles/322777/
http://www.anandtech.com/show/2888
                
> Store data and checksums together in block file
> -----------------------------------------------
>
>                 Key: HDFS-2699
>                 URL: https://issues.apache.org/jira/browse/HDFS-2699
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> The current implementation of HDFS stores the data in one block file and the 
> metadata(checksum) in another block file. This means that every read from 
> HDFS actually consumes two disk iops, one to the datafile and one to the 
> checksum file. This is a major problem for scaling HBase, because HBase is 
> usually  bottlenecked on the number of random disk iops that the 
> storage-hardware offers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-2699) Store data and checksums together in block file

Reply via email to