[jira] Commented: (HADOOP-1823) want InputFormat for bzip2 files

Benjamin Reed (JIRA) Thu, 06 Dec 2007 10:47:07 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549129
 ]


Benjamin Reed commented on HADOOP-1823:
---------------------------------------

We can't really wrap bzip because it is constructed with a byte oriented input 
stream. Bzip however processes a bit oriented stream. (Much to my dismay!) 
Nothing is byte aligned. Thus the signature, block headers, and the compressed 
data do not start on byte boundaries.

There is a non-trivial price to pay to read a bit at a time. In order to wrap 
we would have to read the input as bits to find the signature, align the bits 
to bytes (shift into new bytes), output a new byte stream, bzip would then 
convert those bytes back to bits and process the bit stream. I think bzip has 
enough overhead as it is.

I agree that it would be great to submit a patch to Ant to add this 
functionality to bzip.

> want InputFormat for bzip2 files
> --------------------------------
>
>                 Key: HADOOP-1823
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1823
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>         Attachments: bzip2.jar
>
>
> Unlike gzip, the bzip file format supports splitting.  Compression is by 
> blocks (900k by default) and blocks are separated by a synchronization marker 
> (a 48-bit approximation of Pi).  This would permit very large compressed 
> files to be split into multiple map tasks, which is not currently possible 
> unless using a Hadoop-specific file format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1823) want InputFormat for bzip2 files

Reply via email to