[jira] [Commented] (SAMZA-317) Serde for Avro-encoded messages

Martin Kleppmann (JIRA) Tue, 01 Jul 2014 09:13:33 -0700

    [ 
https://issues.apache.org/jira/browse/SAMZA-317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049002#comment-14049002
 ]


Martin Kleppmann commented on SAMZA-317:
----------------------------------------

Some implementation notes:

- Each container would have to consume all partitions of the avro-schemas 
stream. The new partition grouping functionality in SAMZA-71 will make this 
possible.
- The implementation involves not just a serde, but also a SSPGrouper, a 
MessageChooser and an additional input stream (and perhaps more). Configuring 
all of this manually would be quite fiddly. This would suggest that an "Avro 
Serde" could actually be set up as a ConfigRewriter in the job configuration, 
and the ConfigRewriter sets up all the various bits and pieces.
- If a StreamTask passes an input message unmodified to an output stream (as is 
common in a repartitioning job), we will need to decide whether the message 
should be re-encoded using the schema version compiled into the Samza job, or 
whether the exact same byte sequence should be re-emitted. If the job is 
running with an old version of the schema, re-encoding will throw away any 
fields that were added in newer schema versions, so a byte-by-byte passthrough 
may be preferable.
- An alternative approach would be to implement Avro schema handling in Kafka 
rather than in Samza. This means it wouldn't be available if Samza is used with 
other message brokers, but let's face it, Samza already relies heavily on Kafka 
semantics anyway.

> Serde for Avro-encoded messages
> -------------------------------
>
>                 Key: SAMZA-317
>                 URL: https://issues.apache.org/jira/browse/SAMZA-317
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Martin Kleppmann
>
> [Avro|http://avro.apache.org/] is a popular serialization format with several 
> nice characteristics:
> - Can be read and written by many programming languages
> - Compact and fast to serialize/deserialize
> - Supports Java code generation (in a statically typed language, JSON is 
> annoying to work with, because it parses into dynamically typed HashMaps; 
> it's much nicer to work with objects that have real getters and setters)
> - Deep support for [schema 
> evolution|http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html],
>  so you can make backwards-compatible changes to your data as application 
> requirements change
> It would be nice if Samza came with an Avro serde out of the box, making it 
> easy for Samza jobs to consume and produce Avro-encoded messages. If this 
> serde is built on best practices, we can recommend it as a good default 
> choice for many applications.
> This serde is not entirely straightforward to implement because of Avro's 
> schema handling. In order to accurately deserialize an Avro-encoded message, 
> you need to know the exact version of the schema with which it was 
> serialized. Thus, every message needs to be tagged with a schema version 
> number, and we require a schema registry which translates those version 
> numbers into the schema definition (a JSON string).
> At LinkedIn, the MD5 hash of the schema is used as version number, and the 
> mapping from version number to schema is stored in a separate schema registry 
> service (which provides a HTTP API). In Samza, we could avoid the operational 
> complexity of a separate schema registry service, and instead use a stream 
> (e.g. a Kafka topic) for storing the schemas for all Samza jobs. It can 
> similarly use the hash of the schema as the key, and the schema JSON as the 
> value.
> Any job that wants to consume an Avro-encoded topic would then need to first 
> fully consume the schemas stream (as a [bootstrap 
> stream|http://samza.incubator.apache.org/learn/documentation/0.7.0/container/streams.html#bootstrapping])
>  in order to learn the complete mapping of version numbers to schemas. It 
> then has all the information it needs to deserialize any messages.
> Any job that wants to produce Avro-encoded messages first needs to publish 
> the schema version it is using to the schemas stream. Kafka log compaction 
> can take care of the fact that the same schema will be submitted repeatedly.
> There is a race condition: when a new schema version is introduced, any 
> consumers need to first receive the new schema version from the schemas 
> stream before they can decode any messages encoded with this new schema. This 
> could be implemented with a custom MessageChooser, which blocks consumption 
> of any stream that contains a message with an unknown schema version number, 
> until the corresponding schema is received. The blocking could have a 
> timeout, to limit the disruption in case a badly-behaved producer sends 
> messages without publishing their schema. (Such messages cannot be 
> deserialized and would have to be dropped.)
> Related stuff:
> - Some discussion appears in [this mailing list 
> thread|http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201402.mbox/%3C1BBF24B26F58BA4A91B566B7FBEEFF27551F6C7F7D%40EXMBXC01.ms-hosting.nl%3E].
> - We should consider compatibility with 
> [Camus|https://github.com/linkedin/camus], which can take Avro-encoded 
> messages from Kafka topics and load them into HDFS for offline processing.
> - SAMZA-198 discusses an issue that arose in the context of dealing with 
> Avro-serialized messages.
> - AVRO-1124 discusses implementing a standard schema registry service for 
> Avro.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SAMZA-317) Serde for Avro-encoded messages

Reply via email to