[ 
https://issues.apache.org/jira/browse/SAMZA-317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118509#comment-14118509
 ] 

Martin Kleppmann commented on SAMZA-317:
----------------------------------------

[~theduderog]: That's certainly a possibility. A binary encoding for Avro 
schemas has been proposed on AVRO-251, but not yet committed. Do you think you 
could test that Avro patch, and see how big the binary-encoded schema turns out 
to be?

> Serde for Avro-encoded messages
> -------------------------------
>
>                 Key: SAMZA-317
>                 URL: https://issues.apache.org/jira/browse/SAMZA-317
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Martin Kleppmann
>
> [Avro|http://avro.apache.org/] is a popular serialization format with several 
> nice characteristics:
> - Can be read and written by many programming languages
> - Compact and fast to serialize/deserialize
> - Supports Java code generation (in a statically typed language, JSON is 
> annoying to work with, because it parses into dynamically typed HashMaps; 
> it's much nicer to work with objects that have real getters and setters)
> - Deep support for [schema 
> evolution|http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html],
>  so you can make backwards-compatible changes to your data as application 
> requirements change
> It would be nice if Samza came with an Avro serde out of the box, making it 
> easy for Samza jobs to consume and produce Avro-encoded messages. If this 
> serde is built on best practices, we can recommend it as a good default 
> choice for many applications.
> This serde is not entirely straightforward to implement because of Avro's 
> schema handling. In order to accurately deserialize an Avro-encoded message, 
> you need to know the exact version of the schema with which it was 
> serialized. Thus, every message needs to be tagged with a schema version 
> number, and we require a schema registry which translates those version 
> numbers into the schema definition (a JSON string).
> At LinkedIn, the MD5 hash of the schema is used as version number, and the 
> mapping from version number to schema is stored in a separate schema registry 
> service (which provides a HTTP API). In Samza, we could avoid the operational 
> complexity of a separate schema registry service, and instead use a stream 
> (e.g. a Kafka topic) for storing the schemas for all Samza jobs. It can 
> similarly use the hash of the schema as the key, and the schema JSON as the 
> value.
> Any job that wants to consume an Avro-encoded topic would then need to first 
> fully consume the schemas stream (as a [bootstrap 
> stream|http://samza.incubator.apache.org/learn/documentation/0.7.0/container/streams.html#bootstrapping])
>  in order to learn the complete mapping of version numbers to schemas. It 
> then has all the information it needs to deserialize any messages.
> Any job that wants to produce Avro-encoded messages first needs to publish 
> the schema version it is using to the schemas stream. Kafka log compaction 
> can take care of the fact that the same schema will be submitted repeatedly.
> There is a race condition: when a new schema version is introduced, any 
> consumers need to first receive the new schema version from the schemas 
> stream before they can decode any messages encoded with this new schema. This 
> could be implemented with a custom MessageChooser, which blocks consumption 
> of any stream that contains a message with an unknown schema version number, 
> until the corresponding schema is received. The blocking could have a 
> timeout, to limit the disruption in case a badly-behaved producer sends 
> messages without publishing their schema. (Such messages cannot be 
> deserialized and would have to be dropped.)
> Related stuff:
> - Some discussion appears in [this mailing list 
> thread|http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201402.mbox/%3C1BBF24B26F58BA4A91B566B7FBEEFF27551F6C7F7D%40EXMBXC01.ms-hosting.nl%3E].
> - We should consider compatibility with 
> [Camus|https://github.com/linkedin/camus], which can take Avro-encoded 
> messages from Kafka topics and load them into HDFS for offline processing.
> - SAMZA-198 discusses an issue that arose in the context of dealing with 
> Avro-serialized messages.
> - AVRO-1124 discusses implementing a standard schema registry service for 
> Avro.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to