[ 
https://issues.apache.org/jira/browse/AVRO-135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798941#action_12798941
 ] 

Scott Carey commented on AVRO-135:
----------------------------------

Lets just put deflate or gzip in here for this release.  This is the least 
amount of work, and so long as we make it 'gzip 1' equivalent it isn't that 
slow.  'gzip 1' is about 4x faster compressing than the normal default of 'gzip 
6', with 'gzip 1' on today's CPUs typically between 35MB/sec and 70MB/sec 
throughput on compression.

I have some code I can contribute for Java that replaces GzipOutputStream to 
allow control over the compression ratio and exposes the number of bytes 
written (compressed and uncompressed).   This can use the Hadoop optimized 
PureJavaCRC32 for best performance as well.
At this point, I am eager to use 1.3 and will write such a codec anyway if it 
is not supported (not sure if gzip or deflate, but that is a minor issue on 
~64k blocks).  

LZF/LZO/FastLZ would be nice, but that is more involved.
I've been working on some pure java LZF implementations as an experiment.  I 
chose this over FastLZ because the code was a lot easier to undersdand and much 
better documented (though both are lacking).   Additionally, FastLZ warns that 
the format may change at any time on their site, which also kept me away.  
Short story -- the JIT isn't good enough in Java 6 or Java 7 to do the right 
low level optimizations to catch up to native code yet, but I can get 
compression rates about 80 to 120MB/sec and decompression between 100 and 
160MB/sec with it and compression ratios just slightly worse than LZO but 
better than the C LZF code.

If FastLZ's java library is used, its can use some performance improvements.



> add compression to data files
> -----------------------------
>
>                 Key: AVRO-135
>                 URL: https://issues.apache.org/jira/browse/AVRO-135
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Priority: Blocker
>             Fix For: 1.3.0
>
>
> We should add support for at least one compression codec to data files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to