[
https://issues.apache.org/jira/browse/SAMZA-317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118509#comment-14118509
]
Martin Kleppmann commented on SAMZA-317:
----------------------------------------
[~theduderog]: That's certainly a possibility. A binary encoding for Avro
schemas has been proposed on AVRO-251, but not yet committed. Do you think you
could test that Avro patch, and see how big the binary-encoded schema turns out
to be?
> Serde for Avro-encoded messages
> -------------------------------
>
> Key: SAMZA-317
> URL: https://issues.apache.org/jira/browse/SAMZA-317
> Project: Samza
> Issue Type: New Feature
> Reporter: Martin Kleppmann
>
> [Avro|http://avro.apache.org/] is a popular serialization format with several
> nice characteristics:
> - Can be read and written by many programming languages
> - Compact and fast to serialize/deserialize
> - Supports Java code generation (in a statically typed language, JSON is
> annoying to work with, because it parses into dynamically typed HashMaps;
> it's much nicer to work with objects that have real getters and setters)
> - Deep support for [schema
> evolution|http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html],
> so you can make backwards-compatible changes to your data as application
> requirements change
> It would be nice if Samza came with an Avro serde out of the box, making it
> easy for Samza jobs to consume and produce Avro-encoded messages. If this
> serde is built on best practices, we can recommend it as a good default
> choice for many applications.
> This serde is not entirely straightforward to implement because of Avro's
> schema handling. In order to accurately deserialize an Avro-encoded message,
> you need to know the exact version of the schema with which it was
> serialized. Thus, every message needs to be tagged with a schema version
> number, and we require a schema registry which translates those version
> numbers into the schema definition (a JSON string).
> At LinkedIn, the MD5 hash of the schema is used as version number, and the
> mapping from version number to schema is stored in a separate schema registry
> service (which provides a HTTP API). In Samza, we could avoid the operational
> complexity of a separate schema registry service, and instead use a stream
> (e.g. a Kafka topic) for storing the schemas for all Samza jobs. It can
> similarly use the hash of the schema as the key, and the schema JSON as the
> value.
> Any job that wants to consume an Avro-encoded topic would then need to first
> fully consume the schemas stream (as a [bootstrap
> stream|http://samza.incubator.apache.org/learn/documentation/0.7.0/container/streams.html#bootstrapping])
> in order to learn the complete mapping of version numbers to schemas. It
> then has all the information it needs to deserialize any messages.
> Any job that wants to produce Avro-encoded messages first needs to publish
> the schema version it is using to the schemas stream. Kafka log compaction
> can take care of the fact that the same schema will be submitted repeatedly.
> There is a race condition: when a new schema version is introduced, any
> consumers need to first receive the new schema version from the schemas
> stream before they can decode any messages encoded with this new schema. This
> could be implemented with a custom MessageChooser, which blocks consumption
> of any stream that contains a message with an unknown schema version number,
> until the corresponding schema is received. The blocking could have a
> timeout, to limit the disruption in case a badly-behaved producer sends
> messages without publishing their schema. (Such messages cannot be
> deserialized and would have to be dropped.)
> Related stuff:
> - Some discussion appears in [this mailing list
> thread|http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201402.mbox/%3C1BBF24B26F58BA4A91B566B7FBEEFF27551F6C7F7D%40EXMBXC01.ms-hosting.nl%3E].
> - We should consider compatibility with
> [Camus|https://github.com/linkedin/camus], which can take Avro-encoded
> messages from Kafka topics and load them into HDFS for offline processing.
> - SAMZA-198 discusses an issue that arose in the context of dealing with
> Avro-serialized messages.
> - AVRO-1124 discusses implementing a standard schema registry service for
> Avro.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)