[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

Ryan Blue (JIRA) Sun, 17 Jul 2016 15:46:50 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381561#comment-15381561
 ]


Ryan Blue commented on AVRO-1704:
---------------------------------

I think this should be abstract. The format that we're adding solves one set of 
uses, but the utility methods have value beyond that. Encoding a single Avro 
record is fairly common, but the implementations vary widely in quality because 
it is difficult to find the right setup of DatumWriter, BinaryEncoder, and 
ByteArrayOutputStream. Simplifying and improving applications that already do 
this is a good thing. And some of those uses, like the case I mentioned where 
we're embedding Avro in Parquet records, don't need the header or schema at all 
because that's defined in the file metadata.

The abstraction is also useful for transitioning to the format we're defining 
here. The normal way to encode messages in Kafka is the 8-byte fingerprint 
followed by the encoded message payload. With the abstraction, you can write a 
decoder that checks for the header and then deserializes, or assumes the old 
format if the header is missing. That would enable rolling upgrades using the 
same Kafka topics, rather than needing a hard transition.

I would also include the abstraction in case we want to change or introduce a 
new format later.

bq. I also worry that names like BinaryDatumDecoder

I've pushed a new commit that moves the classes to org.apache.avro.message and 
renames them to MessageEncoder and MessageDecoder. I think used "encoder" 
instead of "reader" to contrast with the DatumReader and DatumWriter, since 
there is little difference between a datum and a message (a datum to encode by 
itself).

bq. Perhaps [the reusable i/o straems] should go in the util package so they 
can be used more widely?

I've moved them there. I avoided it before so that they weren't added to the 
public API, but I think it's fine to make them available.

bq. We might also add utilities for generic & reflect, like, 
model#getMessageWriter(Schema)?

I looked at this, but then the GenericData classes would have both 
createDatumWriter and getMessageWriter, which looks confusing to me. Keeping 
the DatumEncoder above the level of the data models helps separate the 
DatumWriter from the MessageEncoder.

If we want to make instantiating these easier, then maybe a builder would be 
more appropriate. That would allow us to pass multiple writer schemas to the 
MessageDecoder.

> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>            Assignee: Niels Basjes
>             Fix For: 1.9.0, 1.8.3
>
>         Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

Reply via email to