Jens,
As your test shows, using a native codec won't make much sense for
small files, since the involved JNI overhead will likely out-weight
any possible gains. With all the performance improvements in java 5 +
6 its reasonable to ask whether the native implementation does really
improve performance. I'd look at it as another option to further
squeeze out some more performance if you really need to.
- Stefan
On Sun, May 10, 2009 at 11:03 AM, Jens Riboe wrote:
> Hi,
>
> During the past week I decided to use native decompress for a Hadoop job
> (using 0.20.0). But before implementing it I decided to write a small
> benchmark just so understand how much faster (better) it was. The result
> came out as a surprise
>
> May 6, 2009 10:56:47 PM org.apache.hadoop.util.NativeCodeLoader
> INFO: Loaded the native-hadoop library
> May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.zlib.ZlibFactory
>
> INFO: Successfully loaded & initialized native-zlib library
> May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.CodecPool
> getDecompressor
> INFO: Got brand-new decompressor
> Time of Hadoop decompressor running 'small' job = 0:00:01.684 (1.684
> ms/file)
> Time of Hadoop decompressor running 'large' job = 0:00:10.074 (1007.400
> ms/file)
> Time of Vanilla decompressor running 'small' job = 0:00:01.340 (1.340
> ms/file)
> Time of Vanilla decompressor running 'large' job = 0:00:10.094 (1009.400
> ms/file)
> Hadoop vs. Vanilla [small]: 125.67%
> Hadoop vs. Vanilla [large]: 99.80%
>
> For a small file, Hadoop native decompress takes 25% longer time to run that
> Java's built-in GZIPInputStream and for a few megabyte sized file the speed
> difference is negligible.
>
> I wrote a blog post about it which also contains the full source code of the
> benchmark.
> http://blog.ribomation.com/2009/05/07/comparison-of-decompress-ways-in-hadoo
> p/
>
> My questions are:
> [1] Am I missing some key information for how to correctly use native GZIP
> compress?
> I'm using codec pooling by the way.
>
> [2] Will native decompress only take off for files larger than 100MB or
> 1000MB?
> In my application I'm reading many KB sized gz files from an
> external source,
> So I cannot change the compress method nor the file size.
>
> [3] Has anybody experienced something similar to my result?
>
>
> Kind regards /jens
>
>