[jira] Commented: (HADOOP-4012) Providing splitting support for bzip2 compressed files

Chris Douglas (JIRA) Thu, 08 Jan 2009 18:15:25 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662225#action_12662225
 ]


Chris Douglas commented on HADOOP-4012:
---------------------------------------

bq. Hudson informed about 4 core test case failures when it ran patch version 
1. Two of them were arising due to my patch, while the other two were something 
else which automatically got away later. [snip]

Ah, I see. Sorry, I hadn't understood.

bq. Yes the code is the same as submitted in Hadoop-4010. I think the issues 
mentioned there does not affect the patch and it looks fine. I posted it there 
but no one later commented on it. So i had to include it here as well for my 
patch to work.

Marking this issue as _depending on_ or _blocked by_ the related issues, and 
raising them as PA again, is better than creating a composite patch. If you 
feel you've addressed the points made by the commenter, it's appropriate to 
resubmit. There's a lot of traffic, and some responses will inevitably be 
overlooked.

bq. Amongst the currently supported codecs, I think only BZip2 can decompress 
block of a data. But I am not sure that BZip2 is the only one out there

LZO used to be a counterexample, but it has since been removed (HADOOP-4874, 
also why the current patch no longer applies). It doesn't support seeking from 
arbitrary offsets, though. In any case: even recognizing the development 
overhead, adding the new methods to the existing interfaces seems like the more 
invasive approach, particularly since bzip is the only implementer. Given that 
LineRecordReader, NLineInputFormat, LineReader, and FSInputChecker are all 
heavily used, the prospect of accidentally destabilizing them to support an 
optional piece of functionality is unnerving. Are there any specific reasons 
why the existing approach is preferred? If one were to compress binary data 
with the bzip codec, would it be splittable and/or readable?

Thanks for your patience with this.

> Providing splitting support for bzip2 compressed files
> ------------------------------------------------------
>
>                 Key: HADOOP-4012
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4012
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Abdul Qadeer
>            Assignee: Abdul Qadeer
>         Attachments: Hadoop-4012-version1.patch, Hadoop-4012-version2.patch, 
> Hadoop-4012-version3.patch, Hadoop-4012-version4.patch
>
>
> Hadoop assumes that if the input data is compressed, it can not be split 
> (mainly due to the limitation of many codecs that they need the whole input 
> stream to decompress successfully).  So in such a case, Hadoop prepares only 
> one split per compressed file, where the lower split limit is at 0 while the 
> upper limit is the end of the file.  The consequence of this decision is 
> that, one compress file goes to a single mapper. Although it circumvents the 
> limitation of codecs (as mentioned above) but reduces the parallelism 
> substantially, as it was possible otherwise in case of splitting.
> BZip2 is a compression / De-Compression algorithm which does compression on 
> blocks of data and later these compressed blocks can be decompressed 
> independent of each other.  This is indeed an opportunity that instead of one 
> BZip2 compressed file going to one mapper, we can process chunks of file in 
> parallel.  The correctness criteria of such a processing is that for a bzip2 
> compressed file, each compressed block should be processed by only one mapper 
> and ultimately all the blocks of the file should be processed.  (By 
> processing we mean the actual utilization of that un-compressed data (coming 
> out of the codecs) in a mapper).
> We are writing the code to implement this suggested functionality.  Although 
> we have used bzip2 as an example, but we have tried to extend Hadoop's 
> compression interfaces so that any other codecs with the same capability as 
> that of bzip2, could easily use the splitting support.  The details of these 
> changes will be posted when we submit the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4012) Providing splitting support for bzip2 compressed files

Reply via email to