> During this month I refactor the code used for the tests and kept doing > them with the same base mentioned above (about 92 000 files with an average > size of 2kb),
Those are _tiny_. It seems likely to me that you're spending most of your time on I/O related to metadata (disk seeks, directory traversal, file open/ close, codec setup/teardown, buffer-cache churn) and very little on "real" compression or even "real" file I/O. Is any of this happening on HDFS? If so, add network I/O and namenode overhead, too. For Hadoop, your file sizes should start at megabytes or tens of megabytes, and it will really hit its stride above that. Also, are you compressing text or binaries? > In this address http://www.linux.ime.usp.br/~jvcoletto/compression/ I share > the > table with the results obtained in the tests, the code used in the tests and > the results obtained in JProfiler. In my own tests with (C) command-line tools on Linux (and I've now forgotten whether the system used fast SCSI disks or regular SATA), lzop's decompression speed averaged 18-21 compressed MB/sec for binaries and 5-8 cMB/sec for text. gzip on the same corpus averaged 9-10 cMB/sec for binaries and 3.5-4.5 cMB/sec for text. (Text compresses better, so the same input size means more output size => slower due to I/O.) For compression, gzip ranged from 2.5-10 uncompressed MB/sec, depending on data type and compression level. lzop is basically two compressors; for levels 1-6, it averaged 15-16.5 ucMB/sec regardless of input or level, while levels 7-9 dropped from 3 to 1 ucMB/sec. (IOW, don't use LZO levels above 6.) Java interfaces will add some overhead, but since all of the codecs in question are ultimately native C code, this should give you some idea of which numbers are most suspect. But don't bother benchmarking anything much below a megabyte; it's a waste of time. Greg