Hi, During the past week I decided to use native decompress for a Hadoop job (using 0.20.0). But before implementing it I decided to write a small benchmark just so understand how much faster (better) it was. The result came out as a surprise
May 6, 2009 10:56:47 PM org.apache.hadoop.util.NativeCodeLoader <clinit> INFO: Loaded the native-hadoop library May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.zlib.ZlibFactory <clinit> INFO: Successfully loaded & initialized native-zlib library May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.CodecPool getDecompressor INFO: Got brand-new decompressor Time of Hadoop decompressor running 'small' job = 0:00:01.684 (1.684 ms/file) Time of Hadoop decompressor running 'large' job = 0:00:10.074 (1007.400 ms/file) Time of Vanilla decompressor running 'small' job = 0:00:01.340 (1.340 ms/file) Time of Vanilla decompressor running 'large' job = 0:00:10.094 (1009.400 ms/file) Hadoop vs. Vanilla [small]: 125.67% Hadoop vs. Vanilla [large]: 99.80% For a small file, Hadoop native decompress takes 25% longer time to run that Java's built-in GZIPInputStream and for a few megabyte sized file the speed difference is negligible. I wrote a blog post about it which also contains the full source code of the benchmark. http://blog.ribomation.com/2009/05/07/comparison-of-decompress-ways-in-hadoo p/ My questions are: [1] Am I missing some key information for how to correctly use native GZIP compress? I'm using codec pooling by the way. [2] Will native decompress only take off for files larger than 100MB or 1000MB? In my application I'm reading many KB sized gz files from an external source, So I cannot change the compress method nor the file size. [3] Has anybody experienced something similar to my result? Kind regards /jens