[ https://issues.apache.org/jira/browse/HADOOP-6297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998702#comment-12998702 ]
Kevin J. Price commented on HADOOP-6297: ---------------------------------------- SequenceFile just compresses blocks of input into variable output block sizes, this is different from having fixed-size output blocks. The theory is that if the compressed block size is fixed, and an even divisor of the HDFS block size, then a naive 'split at the HDFS block boundaries' will work without having to do any seqing around at the start of each mapper. Theoretically you get less start-of-mapper overhead and less reading from blocks that might not be rack local. I'm honestly not certain anymore that it's the best approach. I have my scheme set up using a little JNI code I threw together that provides full zlib support, and the overall performance gains over sequence files are fairly negligible. It's still functionality that's missing from the Hadoop code that would be easy to add, though. (Oracle is finally fixing this issue in the Java Zlib implementation as part of Java 7.) > Hadoop's support for zlib library lacks support to perform flushes > (Z_SYNC_FLUSH and Z_FULL_FLUSH) > -------------------------------------------------------------------------------------------------- > > Key: HADOOP-6297 > URL: https://issues.apache.org/jira/browse/HADOOP-6297 > Project: Hadoop Common > Issue Type: Improvement > Components: io > Reporter: Kevin J. Price > Assignee: Kevin J. Price > Priority: Minor > Attachments: zlibpatch-0.3.patch, zlibpatch.patch > > > The zlib library supports the ability to perform two types of flushes when > deflating data. It can perform both a Z_SYNC_FLUSH, which forces all input to > be written as output and byte-aligned and resets the Huffman coding, and it > also supports a Z_FULL_FLUSH, which does the same thing but additionally > resets the compression dictionary. The Hadoop wrapper for the zlib library > does not support either of these two methods. > Adding support should be fairly trivial. An additional deflate method that > takes a fourth "flush" parameter, and a modification to the native c code to > accept this fourth parameter and pass it along to the zlib library. I can > submit a patch for this if desired. > It should be noted that the native SUN Java API is likewise missing this > functionality, as has been noted for over a decade here: > http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4206909 -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira