[ https://issues.apache.org/jira/browse/MAPREDUCE-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Greg Roelofs updated MAPREDUCE-469: ----------------------------------- Attachment: grr-hadoop-common.dif.20100614c grr-hadoop-mapreduce.dif.20100614c Almost-final gzip concatenation code (several style-related issues to deal with, but working code, both native and non-native, with no debug statements) and a halfway test case (need to get bzip2 half working). Summary: I implemented an Inflater-based Decompressor with "manual" gzip header/trailer parsing and CRC checks, and added new getRemaining() and resetPartially() methods to the interface. I also modified DecompressorStream to support concatenated streams (decompress() and getCompressedData() methods). For backward compatibility, the default behavior is unchanged; one needs to set the new io.compression.gzip.concat config option to true to turn it on. Since bzip2 apparently changed its behavior without such a setting, perhaps this is overkill... Anyway, this is against trunk (as of a week or two ago). I still need to check it against Yahoo's tree, deal with the FIXMEs, update my source tree(s), run test-patch, etc. Also, I haven't included the (binary) test files here; I'll do so in one of the next versions of the patch. > Support concatenated gzip and bzip2 files > ----------------------------------------- > > Key: MAPREDUCE-469 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-469 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Reporter: Tom White > Assignee: Greg Roelofs > Attachments: grr-hadoop-common.dif.20100614c, > grr-hadoop-mapreduce.dif.20100614c > > > When running MapReduce with concatenated gzip files as input only the first > part is read, which is confusing, to say the least. Concatenated gzip is > described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage > and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at > http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.