[ https://issues.apache.org/jira/browse/HDFS-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181643#comment-13181643 ]
Scott Carey commented on HDFS-2699: ----------------------------------- @Srivas: bq. If you want to eventually support random-IO, then a block size of 4096 is too large for the CRC, as it will cause a read-modify-write cycle on the entire 4K. 512-bytes reduces this overhead. With CRC hardware accelerated now, this is not a big overhead. Without hardware acceleration it is ~800MB/sec for 4096 byte chunks, or 200,000 blocks per second or 25% of one CPU load at 200MB/sec writes. With Hardware acceleration this drops by a factor of 4 to 8. This is besides the point, a paranoid user could configure smaller CRC chunks and test that. I'm suggesting that 4096 is a much saner default. @Todd: bq. Secondly, the disk manufacturers guarantee only a 512-byte atomicity on disk. Linux doing a 4K block write guarantees almost nothing wrt atomicity of that 4K write to disk. On a crash, unless you are running some sort of RAID or data-journal, there is a likelihood of the 4K block that's in-flight getting corrupted. Actually, disk manufacturers are all using 4096 byte atomicity these days (starting with 500GB platters for most manufacturers) **. HDFS should not target protecting power_of_two_butes data with a checksum, but rather (power_of_two_bytes - checksum_size) data so that the hardware atomicity (and OS page cache) lines up exactly with the hdfs checksum chunk + inlined CRC. @Srivas: bq. 2. An append happens a few days later to extend the file from 9K to 11K. CRC3 is now recomputed for the 3K-sized region spanning offsets 8K-11K and written out as CRC3-new. But there is a crash, and the entire 3K is not all written out cleanly This can be avoided entirely. A. The OS and Hardware can avoid partial page writes. ext4 and others can avoid partial page writes. The OS only flushes a page at a time. Hardware these days writes blocks in atomic 4096 byte chunks. B. The inlined CRC can be done so that a single 4096 byte page in the OS contains all of the data and the crc in an atomic chunk, and the CRC and its corresponding data are therefore not split across pages. Under the above conditions, the performance would be excellent, and the data safety higher than the current situation or any application level crc (unless the application is inlining the crc to prevent splitting the data and crc across pages). About the transition to 4096 byte blocks on Hard drives ("Advanced Format" disks): http://www.zdnet.com/blog/storage/are-you-ready-for-4k-sector-drives/731 http://en.wikipedia.org/wiki/Advanced_Format http://www.seagate.com/docs/pdf/whitepaper/tp613_transition_to_4k_sectors.pdf http://lwn.net/Articles/322777/ http://www.anandtech.com/show/2888 > Store data and checksums together in block file > ----------------------------------------------- > > Key: HDFS-2699 > URL: https://issues.apache.org/jira/browse/HDFS-2699 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: dhruba borthakur > Assignee: dhruba borthakur > > The current implementation of HDFS stores the data in one block file and the > metadata(checksum) in another block file. This means that every read from > HDFS actually consumes two disk iops, one to the datafile and one to the > checksum file. This is a major problem for scaling HBase, because HBase is > usually bottlenecked on the number of random disk iops that the > storage-hardware offers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira