As Ryan said > It seems that Java implements `Only the first schema in any union can be used in a default value` as opposed to `Default values for union fields correspond to the first schema in the union` (in the example, it isn't a union field).
I think it's time for us to re-consider such requirement for Unions. I've already customized Avro code to make it happen. On 2019/12/06 10:38:19, Ryan Skraba <r...@skraba.com> wrote: > Hello! I had a Java unit test ready to go (looking at default values > for complex types for AVRO-2636), so just reporting back (the easy > work!): > > 1. In Java, the schema above is parsed without error, but when > attempting to use the default value, it fails with a > NullPointerException (trying to find the symbol C in E1). > > 2. If you were to disambiguate the symbols using the Avro JSON > encoding ("default": [{"E1":"B"},{"E2":"A"},{"E2":"C"}]), Java fails > while parsing the schema: > > org.apache.avro.AvroTypeException: Invalid default for field F: > [{"E1":"B"},{"E2":"A"},{"E2":"C"}] not a > {"type":"array","items":[{"type":"enum","name":"E1","symbols":["A","B"]},{"type":"enum","name":"E2","symbols":["B","A","C"]}]} > at org.apache.avro.Schema.validateDefault(Schema.java:1542) > at org.apache.avro.Schema.access$500(Schema.java:87) > at org.apache.avro.Schema$Field.<init>(Schema.java:523) > at org.apache.avro.Schema.parse(Schema.java:1649) > at org.apache.avro.Schema$Parser.parse(Schema.java:1396) > at org.apache.avro.Schema$Parser.parse(Schema.java:1384) > > It seems that Java implements `Only the first schema in any union can > be used in a default value` as opposed to `Default values for union > fields correspond to the first schema in the union` (in the example, > it isn't a union field). > > Naively, I would expect any JSON encoded data to be a valid default > value (which is not what the spec says). Does anyone know why the > "first schema only" rule was added to the spec? > > Best regards, Ryan > > > > On Thu, Dec 5, 2019 at 7:01 PM Lee Hambley <lee.hamb...@gmail.com> wrote: > > > > Hi Rog, > > > > Glad my pointers were useful, the Avro spec really is a marvel. > > > > Regarding your follow-up question, I'm honestly not sure, interesting > > contrived example however, and interesting that no matter how well written > > the spec is, it can still be ambiguous. > > > > I found this snipped in the 1.9x docs, where I know there was some changes > > to defaults for complex types, the 1.8 docs may be incomplete in that > > regard. ( https://avro.apache.org/docs/1.9.0/spec.html#schema_complex ) > > > >> Default values for union fields correspond to the first schema in the > >> union. Default values for bytes and fixed fields are JSON strings, where > >> Unicode code points 0-255 are mapped to unsigned 8-bit byte values 0-255. > > > > > > I take `Default values for union fields correspond to the first schema in > > the union` to mean that your default including values from the 2nd schema > > in the union is invalid, *or* that where the member exists in the first > > union it refers to the first union, and when not, it refers to the first > > schema in which it _does_ exist. > > > > One way to find out would be to run some data through a couple of common > > implementations, and see how they handle the resulting data, and, maybe > > feed that back into Avro docs in the form of a PR if you come up with > > something useful? > > > > Either way, I'm curious now! Let me know when you have an answer? > > > > Cheers, > > > > Lee Hambley > > http://lee.hambley.name/ > > +49 (0) 170 298 5667 > > > > > > On Thu, 5 Dec 2019 at 14:07, roger peppe <rogpe...@gmail.com> wrote: > >> > >> On Wed, 4 Dec 2019 at 11:38, Lee Hambley <lee.hamb...@gmail.com> wrote: > >>> > >>> HI Rog, > >>> > >>> Good question, the answer lay in the docs in the "Parsing Canonical Form > >>> for Schemas" where it states (amongst all the other transformation rules) > >>> > >>>> [ORDER] Order the appearance of fields of JSON objects as follows: name, > >>>> type, fields, symbols, items, values, size. For example, if an object > >>>> has type, name, and size fields, then the name field should appear > >>>> first, followed by the type and then the size fields. > >>> > >>> > >>> (emphasis mine) > >>> > >>> The canonical form for schemas becomes more relevant to Avro usage when > >>> working with a schema registry for e.g, but it's a really common use-case > >>> and I consider definition of a canonical form for schema comparisons to > >>> be a strength of Avro compared with other serialization formats. > >>> > >>> - > >>> https://avro.apache.org/docs/1.8.2/spec.html#Parsing+Canonical+Form+for+Schemas > >> > >> > >> Thanks very much - I'd missed that, very helpful! > >> > >> Maybe you might be able to help with another part of the spec that I've > >> been puzzling over too: default values for complex types. > >> The spec doesn't seem to say how unions in complex types are specified > >> when in default values. > >> > >> For example, consider the following schema: > >> > >> { > >> "type": "record", > >> "name": "R", > >> "fields": [ > >> { > >> "name": "F", > >> "type": { > >> "type": "array", > >> "items": [ > >> { > >> "type": "enum", > >> "name": "E1", > >> "symbols": ["A", "B"] > >> }, > >> { > >> "type": "enum", > >> "name": "E2", > >> "symbols": ["B", "A", "C"] > >> } > >> ] > >> }, > >> "default": ["A", "B", "C"] > >> } > >> ] > >> } > >> > >> This seems like it should be valid according to the spec, because default > >> value encodings don't encode the type name in enums, unlike in the JSON > >> encoding, but in this case there seems to way to tell which enum types end > >> up in the array value of the field F, because the enum symbols themselves > >> are ambiguous. > >> > >> How are schema validators meant to resolve this ambiguity? > >> > >> cheers, > >> rog. > >> > >>> > >>> HTH, > >>> > >>> Lee Hambley > >>> http://lee.hambley.name/ > >>> +49 (0) 170 298 5667 > >>> > >>> > >>> On Wed, 4 Dec 2019 at 12:17, roger peppe <rogpe...@gmail.com> wrote: > >>>> > >>>> Hi, > >>>> > >>>> My apologies in advance if this topic has been well discussed before - > >>>> the mailing list search tool appears to be broken (the link points to > >>>> the expired domain name "search-hadoop.com"). > >>>> > >>>> I'm trying to understand about recursive types in Avro, given that the > >>>> specification says about names: > >>>> > >>>>> a name must be defined before it is used ("before" in the depth-first, > >>>>> left-to-right traversal of the JSON parse tree, where the types > >>>>> attribute of a protocol is always deemed to come "before" the messages > >>>>> attribute.) > >>>> > >>>> > >>>> By my reading, this would make the following Avro schema invalid, > >>>> because the name "R" will not yet be defined when it's referenced inside > >>>> the type of the field F, because in depth-first order, the leaf is > >>>> traversed before the root. > >>>> > >>>> { > >>>> "type": "record", > >>>> "fields": [ > >>>> {"name": "F", "type": ["null", "R"]} > >>>> ], > >>>> "name": "R" > >>>> } > >>>> > >>>> It seems that types like this are valid in practice (I found the above > >>>> example in an Avro test suite), so could someone enlighten me as to how > >>>> this is allowed, please? > >>>> > >>>> Thanks for any info. If I'm asking in the wrong place, please advise me > >>>> of a better forum! > >>>> > >>>> rog. > >>>> > >>>> >