[ 
https://issues.apache.org/jira/browse/HADOOP-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548103
 ] 

Utkarsh Srivastava commented on HADOOP-1823:
--------------------------------------------

The key reason why we can't use the bzip2 library as-is is that we need to 
track position inside the BZip2 class. What comes out of a bzip2 stream is 
uncompressed bytes, so we can't measure position using that, since the hadoop 
split is in terms of the ucompressed position, which only the Bzip2 class can 
give us.

Another reason is we need specialized skipping logic: if you are the first 
split, you don't want to skip to the next block boundary, but if you are not 
the first split, then you need to skip to the next boundary.

Besides, there were some bugs that we encountered with the standard libraries, 
that we could easily fix, since we were doing our own version. For instance, in 
the version of CBZip2OutputStream available locally, if you open the output 
stream, write nothing to it, and then try to close it, it crashes. 

> want InputFormat for bzip2 files
> --------------------------------
>
>                 Key: HADOOP-1823
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1823
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>         Attachments: bzip2.jar
>
>
> Unlike gzip, the bzip file format supports splitting.  Compression is by 
> blocks (900k by default) and blocks are separated by a synchronization marker 
> (a 48-bit approximation of Pi).  This would permit very large compressed 
> files to be split into multiple map tasks, which is not currently possible 
> unless using a Hadoop-specific file format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to