Hi,

During the past week I decided to use native decompress for a Hadoop job
(using 0.20.0). But before implementing it I decided to write a small
benchmark just so understand how much faster (better) it was. The result
came out as a surprise

May 6, 2009 10:56:47 PM org.apache.hadoop.util.NativeCodeLoader <clinit>
INFO: Loaded the native-hadoop library
May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.zlib.ZlibFactory
<clinit>
INFO: Successfully loaded & initialized native-zlib library
May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.CodecPool
getDecompressor
INFO: Got brand-new decompressor
Time of Hadoop  decompressor running 'small' job = 0:00:01.684 (1.684
ms/file)
Time of Hadoop  decompressor running 'large' job = 0:00:10.074 (1007.400
ms/file)
Time of Vanilla decompressor running 'small' job = 0:00:01.340 (1.340
ms/file)
Time of Vanilla decompressor running 'large' job = 0:00:10.094 (1009.400
ms/file)
Hadoop vs. Vanilla [small]: 125.67%
Hadoop vs. Vanilla [large]: 99.80%

For a small file, Hadoop native decompress takes 25% longer time to run that
Java's built-in GZIPInputStream and for a few megabyte sized file the speed
difference is negligible. 

I wrote a blog post about it which also contains the full source code of the
benchmark.
http://blog.ribomation.com/2009/05/07/comparison-of-decompress-ways-in-hadoo
p/

My questions are:
[1]  Am I missing some key information for how to correctly use native GZIP
compress? 
        I'm using codec pooling by the way.

[2]  Will native decompress only take off for files larger than 100MB or
1000MB? 
        In my application I'm reading many KB sized gz files from an
external source,
        So I cannot change the compress method nor the file size.

[3]  Has anybody experienced something similar to my result?


Kind regards /jens

Reply via email to