[
https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Douglas updated HADOOP-4640:
----------------------------------
Status: Open (was: Patch Available)
bq. Will only skip verifying the checksums in the close method if we haven't
decompressed the whole block. That block will be verified by another split
later anyway.
The data is already decompressed, but it hasn't been read out of the codec's
buffer. Adding a new, public method instead of calculating the checksum for the
remainder of the buffered block seems like the wrong tradeoff. something like:
{code}
public void close() throws IOException {
byte[] b = new byte[4096];
while (!decompressor.finished()) {
decompressor.decompress(b, 0, b.length);
}
super.close();
verifyChecksums();
}
{code}
should work, right? Allocating in the close is less optimal than, say, passing
the Checksum object to the codec, but this requires fewer changes to the
interfaces.
* Using a TreeSet of Long seems unnecessary when the indices are sorted. Since
the number of blocks stored in the index can be calculated from its length, a
type wrapping a long[] seems more appropriate (the member function on said type
can use Arrays::binarySearch instead of TreeSet::ceiling).
* It doesn't need to be part of this patch, but it's worth noting that
splittable lzop inputs will create hot spots of the blocks storing the headers.
If this were abstracted, then the split could be annotated with the properties
of the file and the RecordReader initialized with block properties.
* The count of checksums should include both compressed and decompressed
checksums.
* Instead of {{pos + 8}} in createIndex, it would make more sense to record the
position in the stream after reading the two ints (so skipping the block uses
the more readable {{pos + compressedBlockSize + 4 * numChecksums}}).
* The only termination condition in LzoTextInputFormat::createIndex is
uncompressedBlockSize == 0. Values < 0 for uncompressedBlockSize should throw
EOFException while values <= 0 for compressedBlockSize should throw IOException.
> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
> Key: HADOOP-4640
> URL: https://issues.apache.org/jira/browse/HADOOP-4640
> Project: Hadoop Core
> Issue Type: Improvement
> Components: io, mapred
> Reporter: Johan Oskarsson
> Assignee: Johan Oskarsson
> Priority: Trivial
> Fix For: 0.20.0
>
> Attachments: HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This
> is a shame since the lzo algorithm would be very suitable for large log files
> and similar common hadoop data sets. The compression rate is not the best out
> there but the decompression speed is amazing. Since lzo writes compressed
> data in blocks it would be possible to make an input format that can split
> the files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.