Scott Carey wrote:
{"name": "ipAddr", "type": [
{"name": "IPv4", "type": "fixed", "size": 4 },
{"name": "IPv6", "type": "fixed", "size": 16 }
]
}
This won't work, even though these have their own names and generate their own
classes.
That should work: if it doesn't, that's a bug. It should be possible to
include any two named types in a union, so long as their names differ.
I was also confused at first about naming, and how much of it seems unnecessary from a
"just look at the text" view. For example:
{"name": "Thing", "type": "record", "fields": [
{"name": "foo", "type": "string"},
{"name": "bars", "type": "array", "items": "int"}
]
}
is invalid, but
{"name": "Thing", "type": "record", "fields": [
{"name": "foo", "type": "string"},
{"name": "bars", "type": {"type": "array", "items": "int"}}
]
}
is, even though from a human readable point of view, the extra information is redundant
("I told it it was an array of ints, but it demands to be an array of ints!"
:D)
Have you looked at avrogen? This defines a more user-friendly syntax
for schemas.
I understand what it is doing and why for the most part, but the Spec
didn't make it clear -- some things CAN'T have names. Even, when
required to have a name. (fields must have names, arrays can't have
names, therefore arrays can't be fields? -- half true).
Anything can be given a name by wrapping it in a one-field record, with
no change to the binary encoded form, since records and fields have no
overhead.
Some examples with counter-examples would be a plus.
Please file a documentation bug for this. Thanks!
Also, the error message from the first one above is very confusing for someone new "invalid name
'array'" -- um, I named it "bars", "array" is the type.
Please file a bug to improve this error message.
I think the first snippet above could be shorthand for the latter one although
any ambiguity is bad for making schema resolution robust.
I'm hesitant to add shorthands to the JSON schema language, since it
complicates each implementation. Rather, using a higher-level language
like avrogen is probably appropriate here.
Oddly, adding naming to unions, arrays, etc have the possibility of _reducing_ the
verbosity of the JSON. Yes, one would "have to" name the array, but one
wouldn't be forced to create an anonymous record inside a field, since fields must be
named and almost everything in a *.avsc is a field.
Naming arrays and maps would make them less natural in most programming
languages, where arrays and maps are un-named. Languages with runtime
typing (Java, Python, Ruby, etc.) most naturally represent unions
implicitly with runtime typing.
All of this, including AVRO-248 make me concerned a bit about data migration. If I have a data
file serialized with a schema where unions are unnamed, and then later upgrade, how am I to resolve
those? Should an Avro schema always contain a "version": "1.3" or something
similar in the record definition? Like namespaces it can be assumed that the version propagates
down to children unless overridden. I will want future code to be able to read old schemas, or at
minimum break very reliably.
We will not change the schema language lightly for this very reason.
There have been very few changes to it since Avro was first proposed.
As Avro is deployed, we should make changes even more reluctantly.
An incompatible change to schemas would force an Avro 2.0 release, and
would probably also require that all non-1.x schemas then include a
"version": 2 or somesuch. Personally I hope we never do this, that the
schema language, for better or worse, is for the most part, fixed forever.
* In Memory data representation
Avro is very good at reducing serialized size, but doesn't optimize memory
footprint. None of this is a big deal for the typical Hadoop use case, but for
my use cases where I want to serialize these things into BDB's or some other
key/value store -- in memory footprint is critical. Extra nested object
references can easily consume a lot of memory and reduces the effectiveness of
in memory caching for key/value stores. Another time you would want to make
sure minimum memory is used is in a map side join.
Although some can be done to trim up the Specific API, I think that an
annotations based approach (with ASM) will be the most flexible and powerful in
the long run. For example a fixed size object can just be a byte[] with an
annotation, so that ASM knows how to decorate the setter/getter to enforce the
size and what the Avro properties for the field are, etc. -- rather than having
to be its own object that inherits from an abstract fixed type. Much other
naming can collapse from objects to methods/annotations this way when
generating classes from schemas, and vice-versa to generate schemas from
classes without otherwise altering them.
An annotation-based API might indeed be useful.
Also, have you looked at using ResolvingDecoder and ValidatingEncoder?
These, with little-or-no-performance penalty, allow you to write code
that uses an arbitrary in-memory representation. As an example of this
style, look at the code in
https://issues.apache.org/jira/browse/AVRO-251, notably
Schema#readSchema(), writeSchema, readJson and writeJson. With
generated code, folks often need to write wrapper code to convert from a
possibly pre-existing, manually-maintained representation to the
generated representation and vice-versa. Writing to the Encoder/Decoder
API directly and using ResolvingDecoder and ValidatingEncoder requires
about the same amount of code, is just as safe, handles versioning, and
bypasses the intermediate representation altogether.
* Schema re-use
Schema re-use is a challenge. Since all the types have to be available in
the same JSON parse, some things get duplicated. I have duplicate GUID and
IpAddr named records inside of different *.avsc files, for example. Some
built-in way to have code-reuse will be helpful. I noticed that some unit
tests seem to pre-process some includes. That looks a bit clunky. What else
is there? It would be useful if the Specific compiler took a set of files and
compiled them all, looking across the file set for named items that don't yet
exist in the one currently processing. I suppose I could put them all into one
*.avsc in an array. That won't take long to end up at 50K plus of text in one
file if Avro is used a lot though.
Tools like avrogen should help here too, no? The JSON format is meant
to be a low-level, self-contained representation, and I don't think we
should add high-level features to it.
* Building and Packaging Java
Published maven artifacts including source would be useful. Its nice to pull
up code in eclipse, click through on an Avro class and see the source without
having to configure anything -- or for the debugger to chase source code to
other packages without manually finding the source. Especially when helping
others on their machines (sine I have the Avro source :D).
Dependencies need to be documented somewhere fairly visible. What does Avro need for runtime usage of only the Specific API? Generic? What extra does it need to generate source code from schemas/protocols? What extra does it need to generate protocols/schemas from classes via reflection?
The full "avroj" jar file is huge -- 4MB. Most of it is only needed at build time.
These are all reasonable requests. Please file bug reports for them.
Thanks,
Doug