[ 
https://issues.apache.org/jira/browse/HADOOP-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548375
 ] 

Doug Cutting commented on HADOOP-1823:
--------------------------------------

> The key reason why we can't use the bzip2 library as-is is that we need to 
> track position inside the BZip2 class

The bzip code does no buffering, so wouldn't it suffice to know the position of 
the input stream we pass to it?  We can pass a stream that implements getPos(), 
an FSInputStream, and keep a pointer to that stream, rather than changing the 
codec to support getPos(), no?

> Another reason is we need specialized skipping logic: if you are the first 
> split, you don't want to skip to the next block boundary, but if you are not 
> the first split, then you need to skip to the next boundary.

We could implement this outside too, by scanning for the boundary and then 
backing up six bytes.

> Besides, there were some bugs that we encountered with the standard 
> libraries, that we could easily fix, since we were doing our own version.

Wouldn't it be better to submit a patch to Ant than to fork our own version?

> want InputFormat for bzip2 files
> --------------------------------
>
>                 Key: HADOOP-1823
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1823
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>         Attachments: bzip2.jar
>
>
> Unlike gzip, the bzip file format supports splitting.  Compression is by 
> blocks (900k by default) and blocks are separated by a synchronization marker 
> (a 48-bit approximation of Pi).  This would permit very large compressed 
> files to be split into multiple map tasks, which is not currently possible 
> unless using a Hadoop-specific file format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to