[ 
https://issues.apache.org/jira/browse/MAPREDUCE-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Roelofs updated MAPREDUCE-469:
-----------------------------------

    Attachment: grr-hadoop-common.dif.20100614c
                grr-hadoop-mapreduce.dif.20100614c

Almost-final gzip concatenation code (several style-related issues to deal 
with, but working code, both native and non-native, with no debug statements) 
and a halfway test case (need to get bzip2 half working).

Summary:  I implemented an Inflater-based Decompressor with "manual" gzip 
header/trailer parsing and CRC checks, and added new getRemaining() and 
resetPartially() methods to the interface.  I also modified DecompressorStream 
to support concatenated streams (decompress() and getCompressedData() methods). 
 For backward compatibility, the default behavior is unchanged; one needs to 
set the new io.compression.gzip.concat config option to true to turn it on.  
Since bzip2 apparently changed its behavior without such a setting, perhaps 
this is overkill...

Anyway, this is against trunk (as of a week or two ago).  I still need to check 
it against Yahoo's tree, deal with the FIXMEs, update my source tree(s), run 
test-patch, etc.  Also, I haven't included the (binary) test files here; I'll 
do so in one of the next versions of the patch.

> Support concatenated gzip and bzip2 files
> -----------------------------------------
>
>                 Key: MAPREDUCE-469
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-469
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Tom White
>            Assignee: Greg Roelofs
>         Attachments: grr-hadoop-common.dif.20100614c, 
> grr-hadoop-mapreduce.dif.20100614c
>
>
> When running MapReduce with concatenated gzip files as input only the first 
> part is read, which is confusing, to say the least. Concatenated gzip is 
> described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage 
> and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at 
> http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to