[jira] Commented: (HADOOP-2402) Lzo compression compresses each write from TextOutputFormat

Chris Douglas (JIRA) Thu, 13 Dec 2007 11:25:03 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12551584
 ]


Chris Douglas commented on HADOOP-2402:
---------------------------------------

In the native libs, it looks like io.file.buffer.size determines the maximum 
size of the copy to the OutputStream from the buffer containing the compressed 
data. The GzipCodec and LzoCodec define their own properties and defaults for 
the size of this native buffer (both 64k). The reasoning went, if the buffer is 
larger than the native lib's buffer, it's still going to be blocked until that 
buffer's been flushed to the OutputStream. If the buffer is io.file.buffer.size 
(defaulting to 4k), then it's going to be giving the compression codec data 4k 
at a time. For Lzo, this means it will compress no more than 4k at a time, 
yielding even less than 20% compression.

We could introduce a new property that sets the size of this buffer- or use the 
property given to Gzip/Lzo- but that's not very attractive, either.

LzoCodec returns a stream wrapped in a BlockCompressorStream, but it doesn't 
provide any buffering. It ensures that no more than MAX_INPUT_SIZE (defaulting 
to 64k less the compression overhead) is compressed at once. This might be a 
better place to add some buffering, but then the codec will be returning a 
buffered stream.

> Lzo compression compresses each write from TextOutputFormat
> -----------------------------------------------------------
>
>                 Key: HADOOP-2402
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2402
>             Project: Hadoop
>          Issue Type: Bug
>          Components: io, mapred, native
>            Reporter: Chris Douglas
>             Fix For: 0.16.0
>
>         Attachments: 2402-0.patch
>
>
> Outputting with TextOutputFormat and Lzo compression generates a file such 
> that each key, tab delimiter, and value are compressed separately.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2402) Lzo compression compresses each write from TextOutputFormat

Reply via email to