Re: Versioning Schema's

2013-06-14 Thread David Arthur
I've done this in the past, and it worked out well. Stored Avro schema 
in ZooKeeper with an integer id and prefixed each message with the id. 
You have to make sure when you register a new schema that it resolves 
with the current version (ResolvingDecoder helps with this).


-David

On 6/13/13 4:07 AM, Shone Sadler wrote:

Thanks Jun  Phil!

Shone


On Thu, Jun 13, 2013 at 12:00 AM, Jun Rao jun...@gmail.com wrote:


Yes, we just have customized encoder that encodes the first 4 bytes of md5
of the schema, followed by Avro bytes.

Thanks,

Jun


On Wed, Jun 12, 2013 at 9:50 AM, Shone Sadler shone.sad...@gmail.com

wrote:
Jun,
I like the idea of an explicit version field, if the schema can be

derived

from the topic name itself. The storage (say 1-4 bytes) would require

less

overhead than a 128 bit md5 at the added cost of managing the version#.

Is it correct to assume that your applications are using two schemas

then,

one system level schema to deserialize the schema id and bytes for the
application message and a second schema to deserialize those bytes with

the

application schema?

Thanks again!
Shone


On Wed, Jun 12, 2013 at 11:31 AM, Jun Rao jun...@gmail.com wrote:


Actually, currently our schema id is the md5 of the schema itself. Not
fully sure how this compares with an explicit version field in the

schema.

Thanks,

Jun


On Wed, Jun 12, 2013 at 8:29 AM, Jun Rao jun...@gmail.com wrote:


At LinkedIn, we are using option 2.

Thanks,

Jun


On Wed, Jun 12, 2013 at 7:14 AM, Shone Sadler 

shone.sad...@gmail.com

wrote:


Hello everyone,

After doing some searching on the mailing list for best practices on
integrating Avro with Kafka there appears to be at least 3 options

for

integrating the Avro Schema; 1) embedding the entire schema within

the

message 2) embedding a unique identifier for the schema in the

message

and

3) deriving the schema from the topic/resource name.

Option 2, appears to be the best option in terms of both efficiency

and

flexibility.  However, from a programming perspective it complicates

the

solution with the need for both an envelope schema (containing a

schema

id and bytes field for record data) and message schema

(containing

the

application specific message fields).  This requires two levels of
serialization/deserialization.
Questions:
1) How are others dealing with versioning of schemas?
2) Is there a more elegant means of embedding a schema ids in a Avro
message (I am new to both currently ;-)?

Thanks in advance!

Shone







Re: Versioning Schema's

2013-06-13 Thread Shone Sadler
Thanks Jun  Phil!

Shone


On Thu, Jun 13, 2013 at 12:00 AM, Jun Rao jun...@gmail.com wrote:

 Yes, we just have customized encoder that encodes the first 4 bytes of md5
 of the schema, followed by Avro bytes.

 Thanks,

 Jun


 On Wed, Jun 12, 2013 at 9:50 AM, Shone Sadler shone.sad...@gmail.com
 wrote:

  Jun,
  I like the idea of an explicit version field, if the schema can be
 derived
  from the topic name itself. The storage (say 1-4 bytes) would require
 less
  overhead than a 128 bit md5 at the added cost of managing the version#.
 
  Is it correct to assume that your applications are using two schemas
 then,
  one system level schema to deserialize the schema id and bytes for the
  application message and a second schema to deserialize those bytes with
 the
  application schema?
 
  Thanks again!
  Shone
 
 
  On Wed, Jun 12, 2013 at 11:31 AM, Jun Rao jun...@gmail.com wrote:
 
   Actually, currently our schema id is the md5 of the schema itself. Not
   fully sure how this compares with an explicit version field in the
  schema.
  
   Thanks,
  
   Jun
  
  
   On Wed, Jun 12, 2013 at 8:29 AM, Jun Rao jun...@gmail.com wrote:
  
At LinkedIn, we are using option 2.
   
Thanks,
   
Jun
   
   
On Wed, Jun 12, 2013 at 7:14 AM, Shone Sadler 
 shone.sad...@gmail.com
   wrote:
   
Hello everyone,
   
After doing some searching on the mailing list for best practices on
integrating Avro with Kafka there appears to be at least 3 options
 for
integrating the Avro Schema; 1) embedding the entire schema within
 the
message 2) embedding a unique identifier for the schema in the
 message
   and
3) deriving the schema from the topic/resource name.
   
Option 2, appears to be the best option in terms of both efficiency
  and
flexibility.  However, from a programming perspective it complicates
  the
solution with the need for both an envelope schema (containing a
  schema
id and bytes field for record data) and message schema
 (containing
   the
application specific message fields).  This requires two levels of
serialization/deserialization.
Questions:
1) How are others dealing with versioning of schemas?
2) Is there a more elegant means of embedding a schema ids in a Avro
message (I am new to both currently ;-)?
   
Thanks in advance!
   
Shone
   
   
   
  
 



Versioning Schema's

2013-06-12 Thread Shone Sadler
Hello everyone,

After doing some searching on the mailing list for best practices on
integrating Avro with Kafka there appears to be at least 3 options for
integrating the Avro Schema; 1) embedding the entire schema within the
message 2) embedding a unique identifier for the schema in the message and
3) deriving the schema from the topic/resource name.

Option 2, appears to be the best option in terms of both efficiency and
flexibility.  However, from a programming perspective it complicates the
solution with the need for both an envelope schema (containing a schema
id and bytes field for record data) and message schema (containing the
application specific message fields).  This requires two levels of
serialization/deserialization.
Questions:
1) How are others dealing with versioning of schemas?
2) Is there a more elegant means of embedding a schema ids in a Avro
message (I am new to both currently ;-)?

Thanks in advance!

Shone


Re: Versioning Schema's

2013-06-12 Thread Jun Rao
At LinkedIn, we are using option 2.

Thanks,

Jun


On Wed, Jun 12, 2013 at 7:14 AM, Shone Sadler shone.sad...@gmail.comwrote:

 Hello everyone,

 After doing some searching on the mailing list for best practices on
 integrating Avro with Kafka there appears to be at least 3 options for
 integrating the Avro Schema; 1) embedding the entire schema within the
 message 2) embedding a unique identifier for the schema in the message and
 3) deriving the schema from the topic/resource name.

 Option 2, appears to be the best option in terms of both efficiency and
 flexibility.  However, from a programming perspective it complicates the
 solution with the need for both an envelope schema (containing a schema
 id and bytes field for record data) and message schema (containing the
 application specific message fields).  This requires two levels of
 serialization/deserialization.
 Questions:
 1) How are others dealing with versioning of schemas?
 2) Is there a more elegant means of embedding a schema ids in a Avro
 message (I am new to both currently ;-)?

 Thanks in advance!

 Shone



Re: Versioning Schema's

2013-06-12 Thread Jun Rao
Actually, currently our schema id is the md5 of the schema itself. Not
fully sure how this compares with an explicit version field in the schema.

Thanks,

Jun


On Wed, Jun 12, 2013 at 8:29 AM, Jun Rao jun...@gmail.com wrote:

 At LinkedIn, we are using option 2.

 Thanks,

 Jun


 On Wed, Jun 12, 2013 at 7:14 AM, Shone Sadler shone.sad...@gmail.comwrote:

 Hello everyone,

 After doing some searching on the mailing list for best practices on
 integrating Avro with Kafka there appears to be at least 3 options for
 integrating the Avro Schema; 1) embedding the entire schema within the
 message 2) embedding a unique identifier for the schema in the message and
 3) deriving the schema from the topic/resource name.

 Option 2, appears to be the best option in terms of both efficiency and
 flexibility.  However, from a programming perspective it complicates the
 solution with the need for both an envelope schema (containing a schema
 id and bytes field for record data) and message schema (containing the
 application specific message fields).  This requires two levels of
 serialization/deserialization.
 Questions:
 1) How are others dealing with versioning of schemas?
 2) Is there a more elegant means of embedding a schema ids in a Avro
 message (I am new to both currently ;-)?

 Thanks in advance!

 Shone





Re: Versioning Schema's

2013-06-12 Thread Shone Sadler
Jun,
I like the idea of an explicit version field, if the schema can be derived
from the topic name itself. The storage (say 1-4 bytes) would require less
overhead than a 128 bit md5 at the added cost of managing the version#.

Is it correct to assume that your applications are using two schemas then,
one system level schema to deserialize the schema id and bytes for the
application message and a second schema to deserialize those bytes with the
application schema?

Thanks again!
Shone


On Wed, Jun 12, 2013 at 11:31 AM, Jun Rao jun...@gmail.com wrote:

 Actually, currently our schema id is the md5 of the schema itself. Not
 fully sure how this compares with an explicit version field in the schema.

 Thanks,

 Jun


 On Wed, Jun 12, 2013 at 8:29 AM, Jun Rao jun...@gmail.com wrote:

  At LinkedIn, we are using option 2.
 
  Thanks,
 
  Jun
 
 
  On Wed, Jun 12, 2013 at 7:14 AM, Shone Sadler shone.sad...@gmail.com
 wrote:
 
  Hello everyone,
 
  After doing some searching on the mailing list for best practices on
  integrating Avro with Kafka there appears to be at least 3 options for
  integrating the Avro Schema; 1) embedding the entire schema within the
  message 2) embedding a unique identifier for the schema in the message
 and
  3) deriving the schema from the topic/resource name.
 
  Option 2, appears to be the best option in terms of both efficiency and
  flexibility.  However, from a programming perspective it complicates the
  solution with the need for both an envelope schema (containing a schema
  id and bytes field for record data) and message schema (containing
 the
  application specific message fields).  This requires two levels of
  serialization/deserialization.
  Questions:
  1) How are others dealing with versioning of schemas?
  2) Is there a more elegant means of embedding a schema ids in a Avro
  message (I am new to both currently ;-)?
 
  Thanks in advance!
 
  Shone
 
 
 



Re: Versioning Schema's

2013-06-12 Thread Hargett, Phil
For one of our key Kafka-based applications, we ensure that all messages in the 
stream have a common binary format, which includes (among other things) a 
version identifier and a schema identifier. The version refers to the format 
itself, and the schema refers to the payload, which s the data for the 
application itself. 

Because we have a small number of schemas (50-100) and we only introduce 5-10 
per year, we stash the mapping from schema identifier to schema details for 
easy access in ZooKeeper.

This does basically create 2 levels of serialization but most processing we do 
occurs just by reading the common format, and not deserializing the payload. 
Only specialized code has to do that extra step. 

On Jun 12, 2013, at 12:50 PM, Shone Sadler shone.sad...@gmail.com wrote:

 Jun,
 I like the idea of an explicit version field, if the schema can be derived
 from the topic name itself. The storage (say 1-4 bytes) would require less
 overhead than a 128 bit md5 at the added cost of managing the version#.
 
 Is it correct to assume that your applications are using two schemas then,
 one system level schema to deserialize the schema id and bytes for the
 application message and a second schema to deserialize those bytes with the
 application schema?
 
 Thanks again!
 Shone
 
 
 On Wed, Jun 12, 2013 at 11:31 AM, Jun Rao jun...@gmail.com wrote:
 
 Actually, currently our schema id is the md5 of the schema itself. Not
 fully sure how this compares with an explicit version field in the schema.
 
 Thanks,
 
 Jun
 
 
 On Wed, Jun 12, 2013 at 8:29 AM, Jun Rao jun...@gmail.com wrote:
 
 At LinkedIn, we are using option 2.
 
 Thanks,
 
 Jun
 
 
 On Wed, Jun 12, 2013 at 7:14 AM, Shone Sadler shone.sad...@gmail.com
 wrote:
 
 Hello everyone,
 
 After doing some searching on the mailing list for best practices on
 integrating Avro with Kafka there appears to be at least 3 options for
 integrating the Avro Schema; 1) embedding the entire schema within the
 message 2) embedding a unique identifier for the schema in the message
 and
 3) deriving the schema from the topic/resource name.
 
 Option 2, appears to be the best option in terms of both efficiency and
 flexibility.  However, from a programming perspective it complicates the
 solution with the need for both an envelope schema (containing a schema
 id and bytes field for record data) and message schema (containing
 the
 application specific message fields).  This requires two levels of
 serialization/deserialization.
 Questions:
 1) How are others dealing with versioning of schemas?
 2) Is there a more elegant means of embedding a schema ids in a Avro
 message (I am new to both currently ;-)?
 
 Thanks in advance!
 
 Shone
 
 
 
 


Re: Versioning Schema's

2013-06-12 Thread Jun Rao
Yes, we just have customized encoder that encodes the first 4 bytes of md5
of the schema, followed by Avro bytes.

Thanks,

Jun


On Wed, Jun 12, 2013 at 9:50 AM, Shone Sadler shone.sad...@gmail.comwrote:

 Jun,
 I like the idea of an explicit version field, if the schema can be derived
 from the topic name itself. The storage (say 1-4 bytes) would require less
 overhead than a 128 bit md5 at the added cost of managing the version#.

 Is it correct to assume that your applications are using two schemas then,
 one system level schema to deserialize the schema id and bytes for the
 application message and a second schema to deserialize those bytes with the
 application schema?

 Thanks again!
 Shone


 On Wed, Jun 12, 2013 at 11:31 AM, Jun Rao jun...@gmail.com wrote:

  Actually, currently our schema id is the md5 of the schema itself. Not
  fully sure how this compares with an explicit version field in the
 schema.
 
  Thanks,
 
  Jun
 
 
  On Wed, Jun 12, 2013 at 8:29 AM, Jun Rao jun...@gmail.com wrote:
 
   At LinkedIn, we are using option 2.
  
   Thanks,
  
   Jun
  
  
   On Wed, Jun 12, 2013 at 7:14 AM, Shone Sadler shone.sad...@gmail.com
  wrote:
  
   Hello everyone,
  
   After doing some searching on the mailing list for best practices on
   integrating Avro with Kafka there appears to be at least 3 options for
   integrating the Avro Schema; 1) embedding the entire schema within the
   message 2) embedding a unique identifier for the schema in the message
  and
   3) deriving the schema from the topic/resource name.
  
   Option 2, appears to be the best option in terms of both efficiency
 and
   flexibility.  However, from a programming perspective it complicates
 the
   solution with the need for both an envelope schema (containing a
 schema
   id and bytes field for record data) and message schema (containing
  the
   application specific message fields).  This requires two levels of
   serialization/deserialization.
   Questions:
   1) How are others dealing with versioning of schemas?
   2) Is there a more elegant means of embedding a schema ids in a Avro
   message (I am new to both currently ;-)?
  
   Thanks in advance!
  
   Shone