[ 
https://issues.apache.org/jira/browse/AVRO-135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Zeyliger updated AVRO-135:
---------------------------------

    Attachment: AVRO-135.patch.txt

Attaching a patch that attempts gzip compression.

The patch itself introduces an enum (CompressionCodec), and introduces switch 
statements in DataFileWriter and DataFileStream.  Compression involves writing 
the original data into a buffer, then compressing that in one go into another 
buffer, and finally writing the length of that second buffer followed by its 
contents into the output stream.  For decompression, there's a filter that 
restricts the inputstream to a certain length.  I've modified TestDataFile to 
be parameterized across all possible values of codec, and added a test that's 
more simple, that I used to hammer out the early bugs.  I've added a 
setCompressionCodec() API to DataFileWriter.

In theory, for inflate/deflate and for gzip, we don't actually need to store 
the length.  (See http://www.gzip.org/format.txt .)  It's irritating to do so 
with the GzipInputStream API, but we could fix it so that the bytes that 
weren't used in decompression are returned back to the stream.  (The Inflater 
API has a getRemaining() API that helps do just that, and, in fact, it's used 
by GzipInputStream in readTrailer().)

Do folks have an opinion on whether we should store the length or not?  If we 
don't store it, it's possible we could avoid a memory copy or two by writing 
straight into the output buffer, but it'll complicate the read path a little 
bit.

Do folks have an opinion on gzip vs deflate?  Gzip costs 10 bytes header, 4 
bytes CRC-32, and 4 bytes uncompressed size.

> add compression to data files
> -----------------------------
>
>                 Key: AVRO-135
>                 URL: https://issues.apache.org/jira/browse/AVRO-135
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Philip Zeyliger
>            Priority: Blocker
>             Fix For: 1.3.0
>
>         Attachments: AVRO-135.patch.txt
>
>
> We should add support for at least one compression codec to data files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to