Scott Carey wrote:
{"name": "ipAddr", "type": [
  {"name": "IPv4", "type": "fixed", "size": 4 },
  {"name": "IPv6", "type": "fixed", "size": 16 }
  ]
}
This won't work, even though these have their own names and generate their own 
classes.

That should work: if it doesn't, that's a bug. It should be possible to include any two named types in a union, so long as their names differ.

I was also confused at first about naming, and how much of it seems unnecessary from a 
"just look at the text" view.  For example:
{"name": "Thing", "type": "record", "fields": [
  {"name": "foo", "type": "string"},
  {"name": "bars", "type": "array", "items": "int"}
  ]
}
is invalid, but
{"name": "Thing", "type": "record", "fields": [
  {"name": "foo", "type": "string"},
  {"name": "bars", "type": {"type": "array", "items": "int"}}
  ]
}
is, even though from a human readable point of view, the extra information is redundant 
("I told it it was an array of ints, but it demands to be an array of ints!"  
:D)

Have you looked at avrogen? This defines a more user-friendly syntax for schemas.

I understand what it is doing and why for the most part, but the Spec
didn't make it clear -- some things CAN'T have names. Even, when
required to have a name. (fields must have names, arrays can't have
names, therefore arrays can't be fields? -- half true).

Anything can be given a name by wrapping it in a one-field record, with no change to the binary encoded form, since records and fields have no overhead.

Some examples with counter-examples would be a plus.

Please file a documentation bug for this.  Thanks!

Also, the error message from the first one above is very confusing for someone new "invalid name 
'array'" -- um, I named it "bars", "array" is the type.

Please file a bug to improve this error message.

I think the first snippet above could be shorthand for the latter one although 
any ambiguity is bad for making schema resolution robust.

I'm hesitant to add shorthands to the JSON schema language, since it complicates each implementation. Rather, using a higher-level language like avrogen is probably appropriate here.

Oddly, adding naming to unions, arrays, etc have the possibility of _reducing_ the 
verbosity of the JSON.   Yes, one would "have to" name the array, but one 
wouldn't be forced to create an anonymous record inside a field, since fields must be 
named and almost everything in a *.avsc is a field.

Naming arrays and maps would make them less natural in most programming languages, where arrays and maps are un-named. Languages with runtime typing (Java, Python, Ruby, etc.) most naturally represent unions implicitly with runtime typing.

All of this, including AVRO-248 make me concerned a bit about data migration.  If I have a data 
file serialized with a schema where unions are unnamed, and then later upgrade, how am I to resolve 
those?  Should an Avro schema always contain a "version": "1.3" or something 
similar in the record definition?  Like namespaces it can be assumed that the version propagates 
down to children unless overridden.  I will want future code to be able to read old schemas, or at 
minimum break very reliably.

We will not change the schema language lightly for this very reason. There have been very few changes to it since Avro was first proposed. As Avro is deployed, we should make changes even more reluctantly.

An incompatible change to schemas would force an Avro 2.0 release, and would probably also require that all non-1.x schemas then include a "version": 2 or somesuch. Personally I hope we never do this, that the schema language, for better or worse, is for the most part, fixed forever.

* In Memory data representation
    Avro is very good at reducing serialized size, but doesn't optimize memory 
footprint.  None of this is a big deal for the typical Hadoop use case, but for 
my use cases where I want to serialize these things into BDB's or some other 
key/value store -- in memory footprint is critical.  Extra nested object 
references can easily consume a lot of memory and reduces the effectiveness of 
in memory caching for key/value stores.  Another time you would want to make 
sure minimum memory is used is in a map side join.
    Although some can be done to trim up the Specific API, I think that an 
annotations based approach (with ASM) will be the most flexible and powerful in 
the long run.  For example a fixed size object can just be a byte[] with an 
annotation, so that ASM knows how to decorate the setter/getter to enforce the 
size and what the Avro properties for the field are, etc. -- rather than having 
to be its own object that inherits from an abstract fixed type.  Much other 
naming can collapse from objects to methods/annotations this way when 
generating classes from schemas, and vice-versa to generate schemas from 
classes without otherwise altering them.

An annotation-based API might indeed be useful.

Also, have you looked at using ResolvingDecoder and ValidatingEncoder? These, with little-or-no-performance penalty, allow you to write code that uses an arbitrary in-memory representation. As an example of this style, look at the code in https://issues.apache.org/jira/browse/AVRO-251, notably Schema#readSchema(), writeSchema, readJson and writeJson. With generated code, folks often need to write wrapper code to convert from a possibly pre-existing, manually-maintained representation to the generated representation and vice-versa. Writing to the Encoder/Decoder API directly and using ResolvingDecoder and ValidatingEncoder requires about the same amount of code, is just as safe, handles versioning, and bypasses the intermediate representation altogether.

* Schema re-use
   Schema re-use is a challenge.  Since all the types have to be available in 
the same JSON parse, some things get duplicated.  I have duplicate GUID and 
IpAddr named records inside of different *.avsc files, for example.   Some 
built-in way to have code-reuse will be helpful.  I noticed that some unit 
tests seem to pre-process some includes.  That looks a bit clunky.  What else 
is there?  It would be useful if the Specific compiler took a set of files and 
compiled them all, looking across the file set for named items that don't yet 
exist in the one currently processing.  I suppose I could put them all into one 
*.avsc in an array.  That won't take long to end up at 50K plus of text in one 
file if Avro is used a lot though.

Tools like avrogen should help here too, no? The JSON format is meant to be a low-level, self-contained representation, and I don't think we should add high-level features to it.

* Building and Packaging Java
 Published maven artifacts including source would be useful.  Its nice to pull 
up code in eclipse, click through on an Avro class and see the source without 
having to configure anything -- or for the debugger to chase source code to 
other packages without manually finding the source.  Especially when helping 
others on their machines (sine I have the Avro source :D).
Dependencies need to be documented somewhere fairly visible. What does Avro need for runtime usage of only the Specific API? Generic? What extra does it need to generate source code from schemas/protocols? What extra does it need to generate protocols/schemas from classes via reflection? The full "avroj" jar file is huge -- 4MB. Most of it is only needed at build time.

These are all reasonable requests.  Please file bug reports for them.

Thanks,

Doug

Reply via email to