On 09/21/2010 05:18 AM, Thiruvalluvan M. G. wrote:
Here is a design that would improve things a bit more. Instead of
serializing the object against its actual schema, let's say the application
serializes against a union schema in which the object type's schema is a
branch. As the application evolves, the application simply adds a branch to
union.

Where would this union be stored? Is it only stored in the application, or is it stored with the data? I think it would be safest to somehow store it with the dataset, not in the application.

While reading the object, the application expects for one branch but
the serialized object might be using another branch. As long as the branches
"match", Avro would resolve correctly. The current Java generic writer can
correctly pick the branch as long as the object's schema is one of the
branches. The nice thing about this improved design is that, there is no
need to store a separate schema "pointer" along with the object. The
"union-index" essentially acts as the pointer and it is internal to Avro.

It sounds like perhaps you're trying to optimize the size of the pointer from each stored instance to its schema. Is that correct? If so, then one might simply use a table for this. The application stores <pointer,record> pairs, but pointers need not be 16-byte checksums, but could be variable-length integers, starting from zero, that, for most applications, would always fit in a single byte.

If schemas are stored with the dataset, then they could be stored as either:
- the standalone single schema for every item in the dataset, which happens to be a union schema that's managed in a particular way, adding a new entry to the end each time an instance of a new schema is written; or - a table of schemas, whose indices are used as pointers in each datum, with entries added when no existing entry matches a datum to be written.

The two are isomorphic. The former uses more Avro logic but feels more fragile. It's not really an arbitrary schema, but a union that takes advantage of the way that unions are serialized. The latter feels to me like a clearer description of a dataset. In either case the application must manage the table of schemas. The only operation that's simplified is that the top-level union dispatch at read and possibly write would use Avro logic instead of application logic. At write you might even be tempted to bypass Avro logic, since, in maintaining the union, you'd know the branch already, and searching for the right branch might be more costly.

But there is one problem. As per the Avro specification, in order to "match"
two schemas of the same type should have the same name. But two schemas with
the same type and name cannot be branches within a union. Thus the design
above will not work.

The problem with multiple union branches of the same name only arises at write time, not at read time. So, if we allowed multiple branches of the same name in a top-level union at read time then this might work.

A way to address this might be through aliases. If, in the union, each branch but the last, the record has a versioned name, i.e., the union is ["r0", "r1", .., "r"], then writing would work. If "r" then has aliases of ["r0", "r1", ..], then, at read-time, the union would be rewritten as ["r", "r", ...], but where each branch has a different definition. Currently this would fail due to the duplicate names, but if we changed it that so that, in the context of alias rewrites while reading, we permit duplicate names in a top-level union, then this could work as desired.

Doug

Reply via email to