Martin Kleppmann created SAMZA-317:
--------------------------------------

             Summary: Serde for Avro-encoded messages
                 Key: SAMZA-317
                 URL: https://issues.apache.org/jira/browse/SAMZA-317
             Project: Samza
          Issue Type: New Feature
            Reporter: Martin Kleppmann


[Avro|http://avro.apache.org/] is a popular serialization format with several 
nice characteristics:

- Can be read and written by many programming languages
- Compact and fast to serialize/deserialize
- Supports Java code generation (in a statically typed language, JSON is 
annoying to work with, because it parses into dynamically typed HashMaps; it's 
much nicer to work with objects that have real getters and setters)
- Deep support for [schema 
evolution|http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html],
 so you can make backwards-compatible changes to your data as application 
requirements change

It would be nice if Samza came with an Avro serde out of the box, making it 
easy for Samza jobs to consume and produce Avro-encoded messages. If this serde 
is built on best practices, we can recommend it as a good default choice for 
many applications.

This serde is not entirely straightforward to implement because of Avro's 
schema handling. In order to accurately deserialize an Avro-encoded message, 
you need to know the exact version of the schema with which it was serialized. 
Thus, every message needs to be tagged with a schema version number, and we 
require a schema registry which translates those version numbers into the 
schema definition (a JSON string).

At LinkedIn, the MD5 hash of the schema is used as version number, and the 
mapping from version number to schema is stored in a separate schema registry 
service (which provides a HTTP API). In Samza, we could avoid the operational 
complexity of a separate schema registry service, and instead use a stream 
(e.g. a Kafka topic) for storing the schemas for all Samza jobs. It can 
similarly use the hash of the schema as the key, and the schema JSON as the 
value.

Any job that wants to consume an Avro-encoded topic would then need to first 
fully consume the schemas stream (as a [bootstrap 
stream|http://samza.incubator.apache.org/learn/documentation/0.7.0/container/streams.html#bootstrapping])
 in order to learn the complete mapping of version numbers to schemas. It then 
has all the information it needs to deserialize any messages.

Any job that wants to produce Avro-encoded messages first needs to publish the 
schema version it is using to the schemas stream. Kafka log compaction can take 
care of the fact that the same schema will be submitted repeatedly.

There is a race condition: when a new schema version is introduced, any 
consumers need to first receive the new schema version from the schemas stream 
before they can decode any messages encoded with this new schema. This could be 
implemented with a custom MessageChooser, which blocks consumption of any 
stream that contains a message with an unknown schema version number, until the 
corresponding schema is received. The blocking could have a timeout, to limit 
the disruption in case a badly-behaved producer sends messages without 
publishing their schema. (Such messages cannot be deserialized and would have 
to be dropped.)

Related stuff:

- Some discussion appears in [this mailing list 
thread|http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201402.mbox/%3C1BBF24B26F58BA4A91B566B7FBEEFF27551F6C7F7D%40EXMBXC01.ms-hosting.nl%3E].
- We should consider compatibility with 
[Camus|https://github.com/linkedin/camus], which can take Avro-encoded messages 
from Kafka topics and load them into HDFS for offline processing.
- SAMZA-198 discusses an issue that arose in the context of dealing with 
Avro-serialized messages.
- AVRO-1124 discusses implementing a standard schema registry service for Avro.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to