[ 
https://issues.apache.org/jira/browse/AVRO-135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802009#action_12802009
 ] 

Scott Carey commented on AVRO-135:
----------------------------------

bq. in the spec, we should be very clear on whether we're using gzip or 
deflate, as these are often confused. my slight preference would be for 
deflate, since it's more minimal, but i also realize that adding another level 
of CRC will give lots of folks warm fuzzy feelings.

I lean towards deflate.  Though, we have to be very clear what we mean by that, 
there are two kinds of 'deflate' interpretations.  
The first, is the raw compressed scheme, which has no crc or header (RFC 1951). 
 In Java, this is the 'unwrapped' deflate variant.  Some web browsers call this 
"deflate".
The second is a deflate stream with an adler32 checksum, and is the format that 
a *.zip file stores its individual entries (RFC 1950).  This is also known as 
the "ZLIB" format, but some web browsers simply call it 'deflate'.  It has a 6 
byte overhead (2 byte header, 4 byte adler32 checksum).

Lastly, is gzip (RFC 1952), which wraps raw deflate with a header and footer, 
which typically have about 20 bytes overhead.

In the past, I've leaned towards gzip because if a file is written in this 
format, all sorts of utilities can read it.  But we are storing compressed 
blocks within our own file format, so there is no advantage to using gzip.  
Furthermore, the Java API for gzip annoyingly removes the ability to set the 
compression level and to find out the number of bytes output.   I think that 
control over the compression level is highly important for users.
The Deflater API in Java does allow control over the compression level.

The 'ZLIB' deflate format has an adler32 checksum and 2 byte header (and is 
standardized), so if we want a checksum we can choose that instead of gzip.
Otherwise, the raw deflate stream, perhaps with the uncompressed size 
prepended, would be great.

> add compression to data files
> -----------------------------
>
>                 Key: AVRO-135
>                 URL: https://issues.apache.org/jira/browse/AVRO-135
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Philip Zeyliger
>            Priority: Blocker
>             Fix For: 1.3.0
>
>         Attachments: AVRO-135.patch.txt
>
>
> We should add support for at least one compression codec to data files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to