[
https://issues.apache.org/jira/browse/HADOOP-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12551584
]
Chris Douglas commented on HADOOP-2402:
---------------------------------------
In the native libs, it looks like io.file.buffer.size determines the maximum
size of the copy to the OutputStream from the buffer containing the compressed
data. The GzipCodec and LzoCodec define their own properties and defaults for
the size of this native buffer (both 64k). The reasoning went, if the buffer is
larger than the native lib's buffer, it's still going to be blocked until that
buffer's been flushed to the OutputStream. If the buffer is io.file.buffer.size
(defaulting to 4k), then it's going to be giving the compression codec data 4k
at a time. For Lzo, this means it will compress no more than 4k at a time,
yielding even less than 20% compression.
We could introduce a new property that sets the size of this buffer- or use the
property given to Gzip/Lzo- but that's not very attractive, either.
LzoCodec returns a stream wrapped in a BlockCompressorStream, but it doesn't
provide any buffering. It ensures that no more than MAX_INPUT_SIZE (defaulting
to 64k less the compression overhead) is compressed at once. This might be a
better place to add some buffering, but then the codec will be returning a
buffered stream.
> Lzo compression compresses each write from TextOutputFormat
> -----------------------------------------------------------
>
> Key: HADOOP-2402
> URL: https://issues.apache.org/jira/browse/HADOOP-2402
> Project: Hadoop
> Issue Type: Bug
> Components: io, mapred, native
> Reporter: Chris Douglas
> Fix For: 0.16.0
>
> Attachments: 2402-0.patch
>
>
> Outputting with TextOutputFormat and Lzo compression generates a file such
> that each key, tab delimiter, and value are compressed separately.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.