Re: A case for adding revision field to Avro schema

Doug Cutting Tue, 21 Sep 2010 10:56:22 -0700

On 09/21/2010 05:18 AM, Thiruvalluvan M. G. wrote:

Here is a design that would improve things a bit more. Instead of
serializing the object against its actual schema, let's say the application
serializes against a union schema in which the object type's schema is a
branch. As the application evolves, the application simply adds a branch to
union.

Where would this union be stored? Is it only stored in the application,or is it stored with the data? I think it would be safest to somehowstore it with the dataset, not in the application.

While reading the object, the application expects for one branch but
the serialized object might be using another branch. As long as the branches
"match", Avro would resolve correctly. The current Java generic writer can
correctly pick the branch as long as the object's schema is one of the
branches. The nice thing about this improved design is that, there is no
need to store a separate schema "pointer" along with the object. The
"union-index" essentially acts as the pointer and it is internal to Avro.

It sounds like perhaps you're trying to optimize the size of the pointerfrom each stored instance to its schema. Is that correct? If so, thenone might simply use a table for this. The application stores<pointer,record> pairs, but pointers need not be 16-byte checksums, butcould be variable-length integers, starting from zero, that, for mostapplications, would always fit in a single byte.


If schemas are stored with the dataset, then they could be stored as either:

- the standalone single schema for every item in the dataset, whichhappens to be a union schema that's managed in a particular way, addinga new entry to the end each time an instance of a new schema is written; or- a table of schemas, whose indices are used as pointers in eachdatum, with entries added when no existing entry matches a datum to bewritten.

The two are isomorphic. The former uses more Avro logic but feels morefragile. It's not really an arbitrary schema, but a union that takesadvantage of the way that unions are serialized. The latter feels to melike a clearer description of a dataset. In either case the applicationmust manage the table of schemas. The only operation that's simplifiedis that the top-level union dispatch at read and possibly write woulduse Avro logic instead of application logic. At write you might even betempted to bypass Avro logic, since, in maintaining the union, you'dknow the branch already, and searching for the right branch might bemore costly.

But there is one problem. As per the Avro specification, in order to "match"
two schemas of the same type should have the same name. But two schemas with
the same type and name cannot be branches within a union. Thus the design
above will not work.

The problem with multiple union branches of the same name only arises atwrite time, not at read time. So, if we allowed multiple branches ofthe same name in a top-level union at read time then this might work.

A way to address this might be through aliases. If, in the union, eachbranch but the last, the record has a versioned name, i.e., the union is["r0", "r1", .., "r"], then writing would work. If "r" then has aliasesof ["r0", "r1", ..], then, at read-time, the union would be rewritten as["r", "r", ...], but where each branch has a different definition.Currently this would fail due to the duplicate names, but if we changedit that so that, in the context of alias rewrites while reading, wepermit duplicate names in a top-level union, then this could work asdesired.


Doug

Re: A case for adding revision field to Avro schema

Reply via email to