[
https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653190#action_12653190
]
Abdul Qadeer commented on HADOOP-4012:
--------------------------------------
{quote}
The following change was done in this new patch. Before this change,
getPos()
was returning values one less than what it should be. Similarly available()
method
was returning -1 because the value of count becomes -1 at the end of the
chunk.
Should this change be part of a separate issue, then? I'm not sure what you
mean by "two of the 4 bugs", but bug fixes shouldn't be part of large, new
features if the fix is unaffected by the feature.
{quote}
{color:green}
Hudson informed about 4 core test case failures when it ran patch version 1.
Two of them were arising due to my patch, while the other two were something
else which automatically got away later. The new code in LineRecordReader now
depends on getPos() method of FSDataInputStream to find the current stream
position instead of keeping this position record itself. And getPos() was
returning -1 value at the end, when in fact it should not have gone below 0.
So I changed getPos() and available() methods in
src/core/org/apache/hadoop/fs/FSInputChecker.java
Since these changes were necessary for my code to work correctly, I included it
here.
{color}
{quote}
This modifies TestMultipleCacheFiles to append a newline at the end of the
file. Why is this necessary? Is this the same problem as HADOOP-4182?
{quote}
{color:green}
Yes it is the same issue as mentioned in Hadoop-4182.
{color}
{quote}
Pushing the READ_MODE abstraction (and the new createInputStream) into the
CompressionCodec interface, particularly when only bzip supports it, is
inappropriate. If it's applicable to codecs other than bzip, it should be a
separate interface (extending CompressionCodec?). This would also let
instanceof replace canDecompressSplitInput and move seekBackwards to the new
interface. Can you describe what it means for a codec to implement this
superset of functions?
{quote}
{color:green}
Amongst the currently supported codecs, I think only BZip2 can decompress block
of a data. But I am not sure that BZip2 is the only one out there (ONly a
compression expert can tell that which compression algorithms can decompress
blocks of data without needing the whole compressed file) So READ_MODE
abstration provides two modes of operation. The first one is "continous",
where a codec must need the whole compressed
file to decompress successfully. The other mode is "By_Block", where codecs
can decompress parts of a compressed file and whole file is not needed. Hence
such codecs can be used effectively with Hadoop kind of splitting.
"canDecompressSplitInput" asks the codecs that can it decompress given a stream
starting at some random position in a compressed file. Currently Hadoop
assumes that compressed files must be handeled by one mapper because splitting
is not possible. This assumption
is not ture, atleast in the case of BZip2.
"seekBackwards" was needed for the cases when Hadoop split start lies exactly
on a BZip2 marker. These markers are identifications for the algorithm that a
new block is starting. To work the bzip2 codec correctly, the stream needs to
move back about as much as the size of the Marker. Ideally this should be
done in codec class but buffering at different level, made it difficult to move
the stream backwards sitting in the
CBZip2InputStream. So "seekBackwards" means that in LineRecordReader, Hadoop
asks the codecs that does it need to move back the stream.
Making a separate interface for BZip2 style of codec is another possibility. I
had two options, that either to change the current interfaces or make a new
one. I took the first option leaving the final decision to the Hadoop
committers. But changing it as suggested would mean a lot of changes and
testing for me. But if committers think that, it must be changed and can not
be accepted like this then I will change it as suggested.
{color}
{quote}
This patch incorporates HADOOP-4010: Shouldn't this remain with the original
JIRA? Are the issues raised there addressed in this patch?
{quote}
{color:green}
Yes the code is the same as submitted in Hadoop-4010. I think the issues
mentioned there does not affect the patch and it looks fine. I posted it there
but no one later commented on it. So i had to include it here as well for my
patch to work.
{color}
{quote}
Does this add the Seekable interface to CompressionInputStream only to support
getPos() for LineRecordReader?
{quote}
{color:green}
Yes that is right.
{color}
{quote}
This affects too many core components to make the feature freeze for 0.20 (Fri)
{quote}
{color:green}
I think if release 0.20 is planned soon, we can change its effected version to
next one.
{color}
> Providing splitting support for bzip2 compressed files
> ------------------------------------------------------
>
> Key: HADOOP-4012
> URL: https://issues.apache.org/jira/browse/HADOOP-4012
> Project: Hadoop Core
> Issue Type: New Feature
> Components: io
> Reporter: Abdul Qadeer
> Assignee: Abdul Qadeer
> Attachments: Hadoop-4012-version1.patch, Hadoop-4012-version2.patch,
> Hadoop-4012-version3.patch, Hadoop-4012-version4.patch
>
>
> Hadoop assumes that if the input data is compressed, it can not be split
> (mainly due to the limitation of many codecs that they need the whole input
> stream to decompress successfully). So in such a case, Hadoop prepares only
> one split per compressed file, where the lower split limit is at 0 while the
> upper limit is the end of the file. The consequence of this decision is
> that, one compress file goes to a single mapper. Although it circumvents the
> limitation of codecs (as mentioned above) but reduces the parallelism
> substantially, as it was possible otherwise in case of splitting.
> BZip2 is a compression / De-Compression algorithm which does compression on
> blocks of data and later these compressed blocks can be decompressed
> independent of each other. This is indeed an opportunity that instead of one
> BZip2 compressed file going to one mapper, we can process chunks of file in
> parallel. The correctness criteria of such a processing is that for a bzip2
> compressed file, each compressed block should be processed by only one mapper
> and ultimately all the blocks of the file should be processed. (By
> processing we mean the actual utilization of that un-compressed data (coming
> out of the codecs) in a mapper).
> We are writing the code to implement this suggested functionality. Although
> we have used bzip2 as an example, but we have tried to extend Hadoop's
> compression interfaces so that any other codecs with the same capability as
> that of bzip2, could easily use the splitting support. The details of these
> changes will be posted when we submit the code.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.