[
https://issues.apache.org/jira/browse/HADOOP-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548103
]
Utkarsh Srivastava commented on HADOOP-1823:
--------------------------------------------
The key reason why we can't use the bzip2 library as-is is that we need to
track position inside the BZip2 class. What comes out of a bzip2 stream is
uncompressed bytes, so we can't measure position using that, since the hadoop
split is in terms of the ucompressed position, which only the Bzip2 class can
give us.
Another reason is we need specialized skipping logic: if you are the first
split, you don't want to skip to the next block boundary, but if you are not
the first split, then you need to skip to the next boundary.
Besides, there were some bugs that we encountered with the standard libraries,
that we could easily fix, since we were doing our own version. For instance, in
the version of CBZip2OutputStream available locally, if you open the output
stream, write nothing to it, and then try to close it, it crashes.
> want InputFormat for bzip2 files
> --------------------------------
>
> Key: HADOOP-1823
> URL: https://issues.apache.org/jira/browse/HADOOP-1823
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Doug Cutting
> Attachments: bzip2.jar
>
>
> Unlike gzip, the bzip file format supports splitting. Compression is by
> blocks (900k by default) and blocks are separated by a synchronization marker
> (a 48-bit approximation of Pi). This would permit very large compressed
> files to be split into multiple map tasks, which is not currently possible
> unless using a Hadoop-specific file format.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.