[ https://issues.apache.org/jira/browse/AVRO-135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801995#action_12801995 ]
Philip Zeyliger commented on AVRO-135: -------------------------------------- I spent some time thinking about an interface for Codec and want to digest a bit longer. DataFileWriter gets a DatumWriter and a Datum and then it uses DatumWriter.write(D Datum, Encoder) to write the value. In turn, Encoder (which is a BinaryEncoder in this case), writes to an OutputStream. The current approach is to encode into a ByteBufferOutputStream, and, when that reaches a certain size, copy it into the final output. I'm trying to figure out where a Codec interface fits in here. It could: * Pretend to be an OutputStream, i.e., be used when constructing BinaryEncoder(), and offer an uncompressedSize() method, as well as a writeTo() method. So, a replacement for ByteBufferOutputStream. * Pretend to be an Encoder. The advantage here is that you could build a compression scheme that was schema-aware (e.g., semi-columnar or PAX-like), without re-parsing the data. I'm leaning towards the former right now. What do you mean by compress(byte[], int, int, Encoder) above? > add compression to data files > ----------------------------- > > Key: AVRO-135 > URL: https://issues.apache.org/jira/browse/AVRO-135 > Project: Avro > Issue Type: New Feature > Components: java, spec > Reporter: Doug Cutting > Assignee: Philip Zeyliger > Priority: Blocker > Fix For: 1.3.0 > > Attachments: AVRO-135.patch.txt > > > We should add support for at least one compression codec to data files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.