[
https://issues.apache.org/jira/browse/MAPREDUCE-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749333#action_12749333
]
Chris Douglas commented on MAPREDUCE-830:
-----------------------------------------
(related comments in HADOOP-4012)
* Though it's not changed in bzip, since {{getEnd}} is part of the API, it
should be called in {{LineRecordReader}}.
* Since the codec has state, the API demands that {{LineRecordReader}}
synchronize on the codec before creating a splittable stream and calling
{{getStart}} and {{getEnd}} to avoid race conditions (unless a better solution
is found in HADOOP-4012)
* The default dir for unit tests is usually "/tmp", not "."
> Providing BZip2 splitting support for Text data
> -----------------------------------------------
>
> Key: MAPREDUCE-830
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-830
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Affects Versions: 0.21.0
> Reporter: Abdul Qadeer
> Assignee: Abdul Qadeer
> Fix For: 0.21.0
>
> Attachments: MapReduce-830-version1.patch
>
>
> HADOOP-4012 (https://issues.apache.org/jira/browse/HADOOP-4012) is providing
> support to handle BZip2 compressed data such that the input compressed file
> is split at arbitrary points. This JIRA uses that functionality in
> LineRecordReader. The benefit of this work is that, if user provides
> compressed BZip2 Text data, it will be split by Hadoop and hence will be
> processed by multiple mappers. So BZip2 compressed data will be able to
> fully utilize the cluster power. Currently BZip2 compressed Text file goes
> to one mapper and is not split. So the enhancement in this JIRA provides
> splitting support and a considerable performance gains.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.