Re: defaults for complex types (was Re: recursive types)

Andy Le Sun, 22 Mar 2020 00:47:08 -0700

As Ryan said

> It seems that Java implements `Only the first schema in any union can
be used in a default value` as opposed to `Default values for union
fields correspond to the first schema in the union` (in the example,
it isn't a union field).


I think it's time for us to re-consider such requirement for Unions. I've 
already customized Avro code to make it happen.



On 2019/12/06 10:38:19, Ryan Skraba <r...@skraba.com> wrote: 
> Hello!   I had a Java unit test ready to go (looking at default values
> for complex types for AVRO-2636), so just reporting back (the easy
> work!):
> 
> 1. In Java, the schema above is parsed without error, but when
> attempting to use the default value, it fails with a
> NullPointerException (trying to find the symbol C in E1).
> 
> 2. If you were to disambiguate the symbols using the Avro JSON
> encoding ("default": [{"E1":"B"},{"E2":"A"},{"E2":"C"}]), Java fails
> while parsing the schema:
> 
> org.apache.avro.AvroTypeException: Invalid default for field F:
> [{"E1":"B"},{"E2":"A"},{"E2":"C"}] not a
> {"type":"array","items":[{"type":"enum","name":"E1","symbols":["A","B"]},{"type":"enum","name":"E2","symbols":["B","A","C"]}]}
> at org.apache.avro.Schema.validateDefault(Schema.java:1542)
> at org.apache.avro.Schema.access$500(Schema.java:87)
> at org.apache.avro.Schema$Field.<init>(Schema.java:523)
> at org.apache.avro.Schema.parse(Schema.java:1649)
> at org.apache.avro.Schema$Parser.parse(Schema.java:1396)
> at org.apache.avro.Schema$Parser.parse(Schema.java:1384)
> 
> It seems that Java implements `Only the first schema in any union can
> be used in a default value` as opposed to `Default values for union
> fields correspond to the first schema in the union` (in the example,
> it isn't a union field).
> 
> Naively, I would expect any JSON encoded data to be a valid default
> value (which is not what the spec says).  Does anyone know why the
> "first schema only" rule was added to the spec?
> 
> Best regards, Ryan
> 
> 
> 
> On Thu, Dec 5, 2019 at 7:01 PM Lee Hambley <lee.hamb...@gmail.com> wrote:
> >
> > Hi Rog,
> >
> > Glad my pointers were useful, the Avro spec really is a marvel.
> >
> > Regarding your follow-up question, I'm honestly not sure, interesting 
> > contrived example however, and interesting that no matter how well written 
> > the spec is, it can still be ambiguous.
> >
> > I found this snipped in the 1.9x docs, where I know there was some changes 
> > to defaults for complex types, the 1.8 docs may be incomplete in that 
> > regard. ( https://avro.apache.org/docs/1.9.0/spec.html#schema_complex )
> >
> >> Default values for union fields correspond to the first schema in the 
> >> union. Default values for bytes and fixed fields are JSON strings, where 
> >> Unicode code points 0-255 are mapped to unsigned 8-bit byte values 0-255.
> >
> >
> > I take `Default values for union fields correspond to the first schema in 
> > the union` to mean that your default including values from the 2nd schema 
> > in the union is invalid, *or* that where the member exists in the first 
> > union it refers to the first union, and when not, it refers to the first 
> > schema in which it _does_ exist.
> >
> > One way to find out would be to run some data through a couple of common 
> > implementations, and see how they handle the resulting data, and, maybe 
> > feed that back into Avro docs in the form of a PR if you come up with 
> > something useful?
> >
> > Either way, I'm curious now! Let me know when you have an answer?
> >
> > Cheers,
> >
> > Lee Hambley
> > http://lee.hambley.name/
> > +49 (0) 170 298 5667
> >
> >
> > On Thu, 5 Dec 2019 at 14:07, roger peppe <rogpe...@gmail.com> wrote:
> >>
> >> On Wed, 4 Dec 2019 at 11:38, Lee Hambley <lee.hamb...@gmail.com> wrote:
> >>>
> >>> HI Rog,
> >>>
> >>> Good question, the answer lay in the docs in the "Parsing Canonical Form 
> >>> for Schemas" where it states (amongst all the other transformation rules)
> >>>
> >>>> [ORDER] Order the appearance of fields of JSON objects as follows: name, 
> >>>> type, fields, symbols, items, values, size. For example, if an object 
> >>>> has type, name, and size fields, then the name field should appear 
> >>>> first, followed by the type and then the size fields.
> >>>
> >>>
> >>> (emphasis mine)
> >>>
> >>> The canonical form for schemas becomes more relevant to Avro usage when 
> >>> working with a schema registry for e.g, but it's a really common use-case 
> >>> and I consider definition of a canonical form for schema comparisons to 
> >>> be a strength of Avro compared with other serialization formats.
> >>>
> >>> - 
> >>> https://avro.apache.org/docs/1.8.2/spec.html#Parsing+Canonical+Form+for+Schemas
> >>
> >>
> >> Thanks very much - I'd missed that, very helpful!
> >>
> >> Maybe you might be able to help with another part of the spec that I've 
> >> been puzzling over too: default values for complex types.
> >> The spec doesn't seem to say how unions in complex types are specified 
> >> when in default values.
> >>
> >> For example, consider the following schema:
> >>
> >> {
> >>     "type": "record",
> >>     "name": "R",
> >>     "fields": [
> >>         {
> >>             "name": "F",
> >>             "type": {
> >>                 "type": "array",
> >>                 "items": [
> >>                     {
> >>                         "type": "enum",
> >>                         "name": "E1",
> >>                         "symbols": ["A", "B"]
> >>                     },
> >>                     {
> >>                         "type": "enum",
> >>                         "name": "E2",
> >>                         "symbols": ["B", "A", "C"]
> >>                     }
> >>                 ]
> >>             },
> >>             "default": ["A", "B", "C"]
> >>         }
> >>     ]
> >> }
> >>
> >> This seems like it should be valid according to the spec, because default 
> >> value encodings don't encode the type name in enums, unlike in the JSON 
> >> encoding, but in this case there seems to way to tell which enum types end 
> >> up in the array value of the field F, because the enum symbols themselves 
> >> are ambiguous.
> >>
> >> How are schema validators meant to resolve this ambiguity?
> >>
> >>  cheers,
> >>     rog.
> >>
> >>>
> >>> HTH,
> >>>
> >>> Lee Hambley
> >>> http://lee.hambley.name/
> >>> +49 (0) 170 298 5667
> >>>
> >>>
> >>> On Wed, 4 Dec 2019 at 12:17, roger peppe <rogpe...@gmail.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> My apologies in advance if this topic has been well discussed before - 
> >>>> the mailing list search tool appears to be broken (the link points to 
> >>>> the expired domain name "search-hadoop.com").
> >>>>
> >>>> I'm trying to understand about recursive types in Avro, given that the 
> >>>> specification says about names:
> >>>>
> >>>>> a name must be defined before it is used ("before" in the depth-first, 
> >>>>> left-to-right traversal of the JSON parse tree, where the types 
> >>>>> attribute of a protocol is always deemed to come "before" the messages 
> >>>>> attribute.)
> >>>>
> >>>>
> >>>> By my reading, this would make the following Avro schema invalid, 
> >>>> because the name "R" will not yet be defined when it's referenced inside 
> >>>> the type of the field F, because in depth-first order, the leaf is 
> >>>> traversed before the root.
> >>>>
> >>>> {
> >>>>     "type": "record",
> >>>>     "fields": [
> >>>>         {"name": "F", "type": ["null", "R"]}
> >>>>     ],
> >>>>     "name": "R"
> >>>> }
> >>>>
> >>>> It seems that types like this are valid in practice (I found the above 
> >>>> example in an Avro test suite), so could someone enlighten me as to how 
> >>>> this is allowed, please?
> >>>>
> >>>> Thanks for any info. If I'm asking in the wrong place, please advise me 
> >>>> of a better forum!
> >>>>
> >>>>     rog.
> >>>>
> >>>>
>

Re: defaults for complex types (was Re: recursive types)

Reply via email to