Re: Versioning Schema's
I've done this in the past, and it worked out well. Stored Avro schema in ZooKeeper with an integer id and prefixed each message with the id. You have to make sure when you register a new schema that it resolves with the current version (ResolvingDecoder helps with this). -David On 6/13/13 4:07 AM, Shone Sadler wrote: Thanks Jun Phil! Shone On Thu, Jun 13, 2013 at 12:00 AM, Jun Rao jun...@gmail.com wrote: Yes, we just have customized encoder that encodes the first 4 bytes of md5 of the schema, followed by Avro bytes. Thanks, Jun On Wed, Jun 12, 2013 at 9:50 AM, Shone Sadler shone.sad...@gmail.com wrote: Jun, I like the idea of an explicit version field, if the schema can be derived from the topic name itself. The storage (say 1-4 bytes) would require less overhead than a 128 bit md5 at the added cost of managing the version#. Is it correct to assume that your applications are using two schemas then, one system level schema to deserialize the schema id and bytes for the application message and a second schema to deserialize those bytes with the application schema? Thanks again! Shone On Wed, Jun 12, 2013 at 11:31 AM, Jun Rao jun...@gmail.com wrote: Actually, currently our schema id is the md5 of the schema itself. Not fully sure how this compares with an explicit version field in the schema. Thanks, Jun On Wed, Jun 12, 2013 at 8:29 AM, Jun Rao jun...@gmail.com wrote: At LinkedIn, we are using option 2. Thanks, Jun On Wed, Jun 12, 2013 at 7:14 AM, Shone Sadler shone.sad...@gmail.com wrote: Hello everyone, After doing some searching on the mailing list for best practices on integrating Avro with Kafka there appears to be at least 3 options for integrating the Avro Schema; 1) embedding the entire schema within the message 2) embedding a unique identifier for the schema in the message and 3) deriving the schema from the topic/resource name. Option 2, appears to be the best option in terms of both efficiency and flexibility. However, from a programming perspective it complicates the solution with the need for both an envelope schema (containing a schema id and bytes field for record data) and message schema (containing the application specific message fields). This requires two levels of serialization/deserialization. Questions: 1) How are others dealing with versioning of schemas? 2) Is there a more elegant means of embedding a schema ids in a Avro message (I am new to both currently ;-)? Thanks in advance! Shone
Re: Versioning Schema's
Thanks Jun Phil! Shone On Thu, Jun 13, 2013 at 12:00 AM, Jun Rao jun...@gmail.com wrote: Yes, we just have customized encoder that encodes the first 4 bytes of md5 of the schema, followed by Avro bytes. Thanks, Jun On Wed, Jun 12, 2013 at 9:50 AM, Shone Sadler shone.sad...@gmail.com wrote: Jun, I like the idea of an explicit version field, if the schema can be derived from the topic name itself. The storage (say 1-4 bytes) would require less overhead than a 128 bit md5 at the added cost of managing the version#. Is it correct to assume that your applications are using two schemas then, one system level schema to deserialize the schema id and bytes for the application message and a second schema to deserialize those bytes with the application schema? Thanks again! Shone On Wed, Jun 12, 2013 at 11:31 AM, Jun Rao jun...@gmail.com wrote: Actually, currently our schema id is the md5 of the schema itself. Not fully sure how this compares with an explicit version field in the schema. Thanks, Jun On Wed, Jun 12, 2013 at 8:29 AM, Jun Rao jun...@gmail.com wrote: At LinkedIn, we are using option 2. Thanks, Jun On Wed, Jun 12, 2013 at 7:14 AM, Shone Sadler shone.sad...@gmail.com wrote: Hello everyone, After doing some searching on the mailing list for best practices on integrating Avro with Kafka there appears to be at least 3 options for integrating the Avro Schema; 1) embedding the entire schema within the message 2) embedding a unique identifier for the schema in the message and 3) deriving the schema from the topic/resource name. Option 2, appears to be the best option in terms of both efficiency and flexibility. However, from a programming perspective it complicates the solution with the need for both an envelope schema (containing a schema id and bytes field for record data) and message schema (containing the application specific message fields). This requires two levels of serialization/deserialization. Questions: 1) How are others dealing with versioning of schemas? 2) Is there a more elegant means of embedding a schema ids in a Avro message (I am new to both currently ;-)? Thanks in advance! Shone
Versioning Schema's
Hello everyone, After doing some searching on the mailing list for best practices on integrating Avro with Kafka there appears to be at least 3 options for integrating the Avro Schema; 1) embedding the entire schema within the message 2) embedding a unique identifier for the schema in the message and 3) deriving the schema from the topic/resource name. Option 2, appears to be the best option in terms of both efficiency and flexibility. However, from a programming perspective it complicates the solution with the need for both an envelope schema (containing a schema id and bytes field for record data) and message schema (containing the application specific message fields). This requires two levels of serialization/deserialization. Questions: 1) How are others dealing with versioning of schemas? 2) Is there a more elegant means of embedding a schema ids in a Avro message (I am new to both currently ;-)? Thanks in advance! Shone
Re: Versioning Schema's
At LinkedIn, we are using option 2. Thanks, Jun On Wed, Jun 12, 2013 at 7:14 AM, Shone Sadler shone.sad...@gmail.comwrote: Hello everyone, After doing some searching on the mailing list for best practices on integrating Avro with Kafka there appears to be at least 3 options for integrating the Avro Schema; 1) embedding the entire schema within the message 2) embedding a unique identifier for the schema in the message and 3) deriving the schema from the topic/resource name. Option 2, appears to be the best option in terms of both efficiency and flexibility. However, from a programming perspective it complicates the solution with the need for both an envelope schema (containing a schema id and bytes field for record data) and message schema (containing the application specific message fields). This requires two levels of serialization/deserialization. Questions: 1) How are others dealing with versioning of schemas? 2) Is there a more elegant means of embedding a schema ids in a Avro message (I am new to both currently ;-)? Thanks in advance! Shone
Re: Versioning Schema's
Actually, currently our schema id is the md5 of the schema itself. Not fully sure how this compares with an explicit version field in the schema. Thanks, Jun On Wed, Jun 12, 2013 at 8:29 AM, Jun Rao jun...@gmail.com wrote: At LinkedIn, we are using option 2. Thanks, Jun On Wed, Jun 12, 2013 at 7:14 AM, Shone Sadler shone.sad...@gmail.comwrote: Hello everyone, After doing some searching on the mailing list for best practices on integrating Avro with Kafka there appears to be at least 3 options for integrating the Avro Schema; 1) embedding the entire schema within the message 2) embedding a unique identifier for the schema in the message and 3) deriving the schema from the topic/resource name. Option 2, appears to be the best option in terms of both efficiency and flexibility. However, from a programming perspective it complicates the solution with the need for both an envelope schema (containing a schema id and bytes field for record data) and message schema (containing the application specific message fields). This requires two levels of serialization/deserialization. Questions: 1) How are others dealing with versioning of schemas? 2) Is there a more elegant means of embedding a schema ids in a Avro message (I am new to both currently ;-)? Thanks in advance! Shone
Re: Versioning Schema's
Jun, I like the idea of an explicit version field, if the schema can be derived from the topic name itself. The storage (say 1-4 bytes) would require less overhead than a 128 bit md5 at the added cost of managing the version#. Is it correct to assume that your applications are using two schemas then, one system level schema to deserialize the schema id and bytes for the application message and a second schema to deserialize those bytes with the application schema? Thanks again! Shone On Wed, Jun 12, 2013 at 11:31 AM, Jun Rao jun...@gmail.com wrote: Actually, currently our schema id is the md5 of the schema itself. Not fully sure how this compares with an explicit version field in the schema. Thanks, Jun On Wed, Jun 12, 2013 at 8:29 AM, Jun Rao jun...@gmail.com wrote: At LinkedIn, we are using option 2. Thanks, Jun On Wed, Jun 12, 2013 at 7:14 AM, Shone Sadler shone.sad...@gmail.com wrote: Hello everyone, After doing some searching on the mailing list for best practices on integrating Avro with Kafka there appears to be at least 3 options for integrating the Avro Schema; 1) embedding the entire schema within the message 2) embedding a unique identifier for the schema in the message and 3) deriving the schema from the topic/resource name. Option 2, appears to be the best option in terms of both efficiency and flexibility. However, from a programming perspective it complicates the solution with the need for both an envelope schema (containing a schema id and bytes field for record data) and message schema (containing the application specific message fields). This requires two levels of serialization/deserialization. Questions: 1) How are others dealing with versioning of schemas? 2) Is there a more elegant means of embedding a schema ids in a Avro message (I am new to both currently ;-)? Thanks in advance! Shone
Re: Versioning Schema's
For one of our key Kafka-based applications, we ensure that all messages in the stream have a common binary format, which includes (among other things) a version identifier and a schema identifier. The version refers to the format itself, and the schema refers to the payload, which s the data for the application itself. Because we have a small number of schemas (50-100) and we only introduce 5-10 per year, we stash the mapping from schema identifier to schema details for easy access in ZooKeeper. This does basically create 2 levels of serialization but most processing we do occurs just by reading the common format, and not deserializing the payload. Only specialized code has to do that extra step. On Jun 12, 2013, at 12:50 PM, Shone Sadler shone.sad...@gmail.com wrote: Jun, I like the idea of an explicit version field, if the schema can be derived from the topic name itself. The storage (say 1-4 bytes) would require less overhead than a 128 bit md5 at the added cost of managing the version#. Is it correct to assume that your applications are using two schemas then, one system level schema to deserialize the schema id and bytes for the application message and a second schema to deserialize those bytes with the application schema? Thanks again! Shone On Wed, Jun 12, 2013 at 11:31 AM, Jun Rao jun...@gmail.com wrote: Actually, currently our schema id is the md5 of the schema itself. Not fully sure how this compares with an explicit version field in the schema. Thanks, Jun On Wed, Jun 12, 2013 at 8:29 AM, Jun Rao jun...@gmail.com wrote: At LinkedIn, we are using option 2. Thanks, Jun On Wed, Jun 12, 2013 at 7:14 AM, Shone Sadler shone.sad...@gmail.com wrote: Hello everyone, After doing some searching on the mailing list for best practices on integrating Avro with Kafka there appears to be at least 3 options for integrating the Avro Schema; 1) embedding the entire schema within the message 2) embedding a unique identifier for the schema in the message and 3) deriving the schema from the topic/resource name. Option 2, appears to be the best option in terms of both efficiency and flexibility. However, from a programming perspective it complicates the solution with the need for both an envelope schema (containing a schema id and bytes field for record data) and message schema (containing the application specific message fields). This requires two levels of serialization/deserialization. Questions: 1) How are others dealing with versioning of schemas? 2) Is there a more elegant means of embedding a schema ids in a Avro message (I am new to both currently ;-)? Thanks in advance! Shone
Re: Versioning Schema's
Yes, we just have customized encoder that encodes the first 4 bytes of md5 of the schema, followed by Avro bytes. Thanks, Jun On Wed, Jun 12, 2013 at 9:50 AM, Shone Sadler shone.sad...@gmail.comwrote: Jun, I like the idea of an explicit version field, if the schema can be derived from the topic name itself. The storage (say 1-4 bytes) would require less overhead than a 128 bit md5 at the added cost of managing the version#. Is it correct to assume that your applications are using two schemas then, one system level schema to deserialize the schema id and bytes for the application message and a second schema to deserialize those bytes with the application schema? Thanks again! Shone On Wed, Jun 12, 2013 at 11:31 AM, Jun Rao jun...@gmail.com wrote: Actually, currently our schema id is the md5 of the schema itself. Not fully sure how this compares with an explicit version field in the schema. Thanks, Jun On Wed, Jun 12, 2013 at 8:29 AM, Jun Rao jun...@gmail.com wrote: At LinkedIn, we are using option 2. Thanks, Jun On Wed, Jun 12, 2013 at 7:14 AM, Shone Sadler shone.sad...@gmail.com wrote: Hello everyone, After doing some searching on the mailing list for best practices on integrating Avro with Kafka there appears to be at least 3 options for integrating the Avro Schema; 1) embedding the entire schema within the message 2) embedding a unique identifier for the schema in the message and 3) deriving the schema from the topic/resource name. Option 2, appears to be the best option in terms of both efficiency and flexibility. However, from a programming perspective it complicates the solution with the need for both an envelope schema (containing a schema id and bytes field for record data) and message schema (containing the application specific message fields). This requires two levels of serialization/deserialization. Questions: 1) How are others dealing with versioning of schemas? 2) Is there a more elegant means of embedding a schema ids in a Avro message (I am new to both currently ;-)? Thanks in advance! Shone