> During this month I refactor the code used for the tests and kept doing
>  them with the same base mentioned above (about 92 000 files with an average
> size of 2kb),

Those are _tiny_.  It seems likely to me that you're spending most of your
time on I/O related to metadata (disk seeks, directory traversal, file open/
close, codec setup/teardown, buffer-cache churn) and very little on "real"
compression or even "real" file I/O.  Is any of this happening on HDFS?  If
so, add network I/O and namenode overhead, too.

For Hadoop, your file sizes should start at megabytes or tens of megabytes,
and it will really hit its stride above that.

Also, are you compressing text or binaries?

> In this address http://www.linux.ime.usp.br/~jvcoletto/compression/ I share 
> the
> table with the results obtained in the tests, the code used in the tests and
>  the results obtained in JProfiler.

In my own tests with (C) command-line tools on Linux (and I've now forgotten
whether the system used fast SCSI disks or regular SATA), lzop's decompression
speed averaged 18-21 compressed MB/sec for binaries and 5-8 cMB/sec for text.
gzip on the same corpus averaged 9-10 cMB/sec for binaries and 3.5-4.5 cMB/sec
for text.  (Text compresses better, so the same input size means more output
size => slower due to I/O.)

For compression, gzip ranged from 2.5-10 uncompressed MB/sec, depending on
data type and compression level.  lzop is basically two compressors; for
levels 1-6, it averaged 15-16.5 ucMB/sec regardless of input or level, while
levels 7-9 dropped from 3 to 1 ucMB/sec.  (IOW, don't use LZO levels above 6.)

Java interfaces will add some overhead, but since all of the codecs in question
are ultimately native C code, this should give you some idea of which numbers
are most suspect. But don't bother benchmarking anything much below a megabyte;
it's a waste of time.

Greg

Reply via email to