[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Chris Douglas (JIRA) Mon, 17 Nov 2008 14:56:36 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Douglas updated HADOOP-4640:
----------------------------------

    Status: Open  (was: Patch Available)

bq. Will only skip verifying the checksums in the close method if we haven't 
decompressed the whole block. That block will be verified by another split 
later anyway.
The data is already decompressed, but it hasn't been read out of the codec's 
buffer. Adding a new, public method instead of calculating the checksum for the 
remainder of the buffered block seems like the wrong tradeoff. something like:
{code}
public void close() throws IOException {
  byte[] b = new byte[4096];
  while (!decompressor.finished()) {
    decompressor.decompress(b, 0, b.length);
  }
  super.close();
  verifyChecksums();
}
{code}
should work, right? Allocating in the close is less optimal than, say, passing 
the Checksum object to the codec, but this requires fewer changes to the 
interfaces.

* Using a TreeSet of Long seems unnecessary when the indices are sorted. Since 
the number of blocks stored in the index can be calculated from its length, a 
type wrapping a long[] seems more appropriate (the member function on said type 
can use Arrays::binarySearch instead of TreeSet::ceiling).
* It doesn't need to be part of this patch, but it's worth noting that 
splittable lzop inputs will create hot spots of the blocks storing the headers. 
If this were abstracted, then the split could be annotated with the properties 
of the file and the RecordReader initialized with block properties.
* The count of checksums should include both compressed and decompressed 
checksums.
* Instead of {{pos + 8}} in createIndex, it would make more sense to record the 
position in the stream after reading the two ints (so skipping the block uses 
the more readable {{pos + compressedBlockSize + 4 * numChecksums}}).
* The only termination condition in LzoTextInputFormat::createIndex is 
uncompressedBlockSize == 0. Values < 0 for uncompressedBlockSize should throw 
EOFException while values <= 0 for compressedBlockSize should throw IOException.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This 
> is a shame since the lzo algorithm would be very suitable for large log files 
> and similar common hadoop data sets. The compression rate is not the best out 
> there but the decompression speed is amazing.  Since lzo writes compressed 
> data in blocks it would be possible to make an input format that can split 
> the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Reply via email to