Re: Several topics: Naming, in memory representation of Avro objects, future format enchancements

Doug Cutting Wed, 13 Jan 2010 10:52:46 -0800

Scott Carey wrote:

{"name": "ipAddr", "type": [
  {"name": "IPv4", "type": "fixed", "size": 4 },
  {"name": "IPv6", "type": "fixed", "size": 16 }
  ]
}
This won't work, even though these have their own names and generate their own 
classes.

That should work: if it doesn't, that's a bug. It should be possible toinclude any two named types in a union, so long as their names differ.

I was also confused at first about naming, and how much of it seems unnecessary from a 
"just look at the text" view.  For example:
{"name": "Thing", "type": "record", "fields": [
  {"name": "foo", "type": "string"},
  {"name": "bars", "type": "array", "items": "int"}
  ]
}
is invalid, but
{"name": "Thing", "type": "record", "fields": [
  {"name": "foo", "type": "string"},
  {"name": "bars", "type": {"type": "array", "items": "int"}}
  ]
}
is, even though from a human readable point of view, the extra information is redundant 
("I told it it was an array of ints, but it demands to be an array of ints!"  
:D)

Have you looked at avrogen? This defines a more user-friendly syntaxfor schemas.

I understand what it is doing and why for the most part, but the Spec
didn't make it clear -- some things CAN'T have names. Even, when
required to have a name. (fields must have names, arrays can't have
names, therefore arrays can't be fields? -- half true).

Anything can be given a name by wrapping it in a one-field record, withno change to the binary encoded form, since records and fields have nooverhead.

Some examples with counter-examples would be a plus.


Please file a documentation bug for this.  Thanks!

Also, the error message from the first one above is very confusing for someone new "invalid name 
'array'" -- um, I named it "bars", "array" is the type.


Please file a bug to improve this error message.

I think the first snippet above could be shorthand for the latter one although 
any ambiguity is bad for making schema resolution robust.

I'm hesitant to add shorthands to the JSON schema language, since itcomplicates each implementation. Rather, using a higher-level languagelike avrogen is probably appropriate here.

Oddly, adding naming to unions, arrays, etc have the possibility of _reducing_ the 
verbosity of the JSON.   Yes, one would "have to" name the array, but one 
wouldn't be forced to create an anonymous record inside a field, since fields must be 
named and almost everything in a *.avsc is a field.

Naming arrays and maps would make them less natural in most programminglanguages, where arrays and maps are un-named. Languages with runtimetyping (Java, Python, Ruby, etc.) most naturally represent unionsimplicitly with runtime typing.

All of this, including AVRO-248 make me concerned a bit about data migration.  If I have a data 
file serialized with a schema where unions are unnamed, and then later upgrade, how am I to resolve 
those?  Should an Avro schema always contain a "version": "1.3" or something 
similar in the record definition?  Like namespaces it can be assumed that the version propagates 
down to children unless overridden.  I will want future code to be able to read old schemas, or at 
minimum break very reliably.

We will not change the schema language lightly for this very reason.There have been very few changes to it since Avro was first proposed.As Avro is deployed, we should make changes even more reluctantly.

An incompatible change to schemas would force an Avro 2.0 release, andwould probably also require that all non-1.x schemas then include a"version": 2 or somesuch. Personally I hope we never do this, that theschema language, for better or worse, is for the most part, fixed forever.

* In Memory data representation
    Avro is very good at reducing serialized size, but doesn't optimize memory 
footprint.  None of this is a big deal for the typical Hadoop use case, but for 
my use cases where I want to serialize these things into BDB's or some other 
key/value store -- in memory footprint is critical.  Extra nested object 
references can easily consume a lot of memory and reduces the effectiveness of 
in memory caching for key/value stores.  Another time you would want to make 
sure minimum memory is used is in a map side join.
    Although some can be done to trim up the Specific API, I think that an 
annotations based approach (with ASM) will be the most flexible and powerful in 
the long run.  For example a fixed size object can just be a byte[] with an 
annotation, so that ASM knows how to decorate the setter/getter to enforce the 
size and what the Avro properties for the field are, etc. -- rather than having 
to be its own object that inherits from an abstract fixed type.  Much other 
naming can collapse from objects to methods/annotations this way when 
generating classes from schemas, and vice-versa to generate schemas from 
classes without otherwise altering them.


An annotation-based API might indeed be useful.

Also, have you looked at using ResolvingDecoder and ValidatingEncoder?These, with little-or-no-performance penalty, allow you to write codethat uses an arbitrary in-memory representation. As an example of thisstyle, look at the code inhttps://issues.apache.org/jira/browse/AVRO-251, notablySchema#readSchema(), writeSchema, readJson and writeJson. Withgenerated code, folks often need to write wrapper code to convert from apossibly pre-existing, manually-maintained representation to thegenerated representation and vice-versa. Writing to the Encoder/DecoderAPI directly and using ResolvingDecoder and ValidatingEncoder requiresabout the same amount of code, is just as safe, handles versioning, andbypasses the intermediate representation altogether.

* Schema re-use
   Schema re-use is a challenge.  Since all the types have to be available in 
the same JSON parse, some things get duplicated.  I have duplicate GUID and 
IpAddr named records inside of different *.avsc files, for example.   Some 
built-in way to have code-reuse will be helpful.  I noticed that some unit 
tests seem to pre-process some includes.  That looks a bit clunky.  What else 
is there?  It would be useful if the Specific compiler took a set of files and 
compiled them all, looking across the file set for named items that don't yet 
exist in the one currently processing.  I suppose I could put them all into one 
*.avsc in an array.  That won't take long to end up at 50K plus of text in one 
file if Avro is used a lot though.

Tools like avrogen should help here too, no? The JSON format is meantto be a low-level, self-contained representation, and I don't think weshould add high-level features to it.

* Building and Packaging Java
 Published maven artifacts including source would be useful.  Its nice to pull 
up code in eclipse, click through on an Avro class and see the source without 
having to configure anything -- or for the debugger to chase source code to 
other packages without manually finding the source.  Especially when helping 
others on their machines (sine I have the Avro source :D).
Dependencies need to be documented somewhere fairly visible. What does Avro need for runtime usage of only the Specific API? Generic? What extra does it need to generate source code from schemas/protocols? What extra does it need to generate protocols/schemas from classes via reflection?The full "avroj" jar file is huge -- 4MB. Most of it is only needed at build time.


These are all reasonable requests.  Please file bug reports for them.

Thanks,

Doug

Re: Several topics: Naming, in memory representation of Avro objects, future format enchancements

Reply via email to