[jira] Commented: (HADOOP-4012) Providing splitting support for bzip2 compressed files

Abdul Qadeer (JIRA) Wed, 03 Dec 2008 22:05:41 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653190#action_12653190
 ]


Abdul Qadeer commented on HADOOP-4012:
--------------------------------------

{quote}
    The following change was done in this new patch. Before this change, 
getPos()
    was returning values one less than what it should be. Similarly available() 
method
    was returning -1 because the value of count becomes -1 at the end of the 
chunk. 

Should this change be part of a separate issue, then? I'm not sure what you 
mean by "two of the 4 bugs", but bug fixes shouldn't be part of large, new 
features if the fix is unaffected by the feature.
{quote}

{color:green}
Hudson informed about 4 core test case failures when it ran patch version 1. 
Two of them were arising due to my patch, while the other two were something 
else which automatically got away later.  The new code in LineRecordReader now 
depends on getPos() method of FSDataInputStream to find the current stream 
position instead of keeping this position record itself.  And getPos() was 
returning -1 value at the end, when in fact it should not have gone below 0.  
So I changed getPos() and available() methods in 
src/core/org/apache/hadoop/fs/FSInputChecker.java
Since these changes were necessary for my code to work correctly, I included it 
here. 
{color}

{quote}
This modifies TestMultipleCacheFiles to append a newline at the end of the 
file. Why is this necessary? Is this the same problem as HADOOP-4182?
{quote}

{color:green}
Yes it is the same issue as mentioned in Hadoop-4182.
{color}

{quote}
Pushing the READ_MODE abstraction (and the new createInputStream) into the 
CompressionCodec interface, particularly when only bzip supports it, is 
inappropriate. If it's applicable to codecs other than bzip, it should be a 
separate interface (extending CompressionCodec?). This would also let 
instanceof replace canDecompressSplitInput and move seekBackwards to the new 
interface. Can you describe what it means for a codec to implement this 
superset of functions?
{quote}

{color:green}
Amongst the currently supported codecs, I think only BZip2 can decompress block 
of a data.  But I am not sure that BZip2 is the only one out there (ONly a 
compression expert can tell that which compression algorithms can decompress 
blocks of data without needing the whole compressed file)  So READ_MODE 
abstration provides two modes of  operation.  The first one is "continous", 
where a codec must need the whole compressed 
file to decompress successfully.  The other mode is "By_Block", where codecs 
can decompress parts of a compressed file and whole file is not needed.  Hence 
such codecs can be used effectively with Hadoop kind of splitting.

"canDecompressSplitInput" asks the codecs that can it decompress given a stream 
starting  at some random position in a compressed file.  Currently Hadoop 
assumes that compressed  files must be handeled by one mapper because splitting 
is not possible.  This assumption 
is not ture, atleast in the case of BZip2.

"seekBackwards" was needed for the cases when Hadoop split start lies exactly 
on a BZip2  marker.  These markers are identifications for the algorithm that a 
new block is starting.  To work the bzip2 codec correctly, the stream needs to 
move back about as much  as the size of the Marker.  Ideally this should be 
done in codec class but buffering at different level, made it difficult to move 
the stream backwards sitting in the 
CBZip2InputStream.  So "seekBackwards" means that in LineRecordReader, Hadoop 
asks the codecs that does it need to move back the stream.

Making a separate interface for BZip2 style of codec is another possibility.  I 
had two options, that either to change the current interfaces or make a new 
one.  I took the first option leaving the final decision to the Hadoop 
committers.  But changing it as suggested would mean a lot of changes and 
testing for me.  But if committers think that, it must be changed and can not 
be accepted like this then I will change it as suggested.
{color}

{quote}
This patch incorporates HADOOP-4010: Shouldn't this remain with the original 
JIRA? Are the issues raised there addressed in this patch?
{quote}

{color:green}
Yes the code is the same as submitted in Hadoop-4010.  I think the issues 
mentioned there does not affect the patch and it looks fine.  I posted it there 
but no one later commented on it.  So i had to include it here as well for my 
patch to work.
{color}

{quote}
Does this add the Seekable interface to CompressionInputStream only to support 
getPos() for LineRecordReader?
{quote}

{color:green}
Yes that is right.
{color}

{quote}
This affects too many core components to make the feature freeze for 0.20 (Fri)
{quote}

{color:green}
I think if release 0.20 is planned soon, we can change its effected version to 
next one.
{color}

> Providing splitting support for bzip2 compressed files
> ------------------------------------------------------
>
>                 Key: HADOOP-4012
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4012
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Abdul Qadeer
>            Assignee: Abdul Qadeer
>         Attachments: Hadoop-4012-version1.patch, Hadoop-4012-version2.patch, 
> Hadoop-4012-version3.patch, Hadoop-4012-version4.patch
>
>
> Hadoop assumes that if the input data is compressed, it can not be split 
> (mainly due to the limitation of many codecs that they need the whole input 
> stream to decompress successfully).  So in such a case, Hadoop prepares only 
> one split per compressed file, where the lower split limit is at 0 while the 
> upper limit is the end of the file.  The consequence of this decision is 
> that, one compress file goes to a single mapper. Although it circumvents the 
> limitation of codecs (as mentioned above) but reduces the parallelism 
> substantially, as it was possible otherwise in case of splitting.
> BZip2 is a compression / De-Compression algorithm which does compression on 
> blocks of data and later these compressed blocks can be decompressed 
> independent of each other.  This is indeed an opportunity that instead of one 
> BZip2 compressed file going to one mapper, we can process chunks of file in 
> parallel.  The correctness criteria of such a processing is that for a bzip2 
> compressed file, each compressed block should be processed by only one mapper 
> and ultimately all the blocks of the file should be processed.  (By 
> processing we mean the actual utilization of that un-compressed data (coming 
> out of the codecs) in a mapper).
> We are writing the code to implement this suggested functionality.  Although 
> we have used bzip2 as an example, but we have tried to extend Hadoop's 
> compression interfaces so that any other codecs with the same capability as 
> that of bzip2, could easily use the splitting support.  The details of these 
> changes will be posted when we submit the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4012) Providing splitting support for bzip2 compressed files

Reply via email to