Re: More idiomatic JSON encoding for unions

roger peppe Tue, 14 Jan 2020 13:58:52 -0800

On Tue, 14 Jan 2020 at 19:26, Zoltan Farkas <zolyfar...@yahoo.com> wrote:


> Makes sense,
>
> We have to agree on he scope of this implementation.
>
> Right now the implementation I have in java, handles only the:
>
> union {null, [some type]} situation.
>
> Are we ok with this for a start?
>

I'm not sure that it's worth publishing a half-way solution, as if people
start using it and a fuller solution is implemented, there will be three
incompatible standards, which isn't ideal.

>
> What I see more, is to handle:
>
> 1) union {string, double}, (although we have to specify behavior for NAN,
> Positive and negative infinity);  union {string, boolean}; ….
>

My thought, as mentioned at the beginning of this thread, is to omit the
wrapping when all the members of the union encode to distinct JSON token
types (the JSON token types being: null, boolean, string, number, object
and array).

I think that we could probably leave out explicit mention of NaN and
infinity, as that's an issue with schemas too, and there's no obviously
good solution. That said, if we *did* want to solve the issue of NaN and
infinity in the future, things might get awkward with respect to this
thread's proposal, because it's likely that the only reasonable way to
solve that issue is to encode NaN and infinity as "NaN" and "±Infinity",
which means that the union ["string", "float"] becomes ambiguous if we
leave out the type name for that case.

It seems that it's not unheard-of to a string representation for these
float values (see https://issues.apache.org/jira/browse/AVRO-1290).

So perhaps we could define the format something like this:


*JSON Encoding *
>
> Except for unions, the JSON encoding is the same as is used to encode
field default values.

> The value of a union is encoded in JSON as follows:

>
   - if all values of the union can be distinguished *unambiguously* (see
   below), the JSON encoding is the same as is used to encode field default
   values for the type
   - otherwise it is encoded as a JSON object with one name/value pair
   whose name is the type's name and whose value is the recursively encoded
   value. For Avro's named types (record, fixed or enum) the user-specified
   name is used, for other types the type name is used.

Unambiguity is defined as follows:

>
> An Avro value can be encoded as one of a set of JSON types:

>
   - null encodes as {null}
   - boolean encodes as {boolean}
   - int encodes as {number}
   - long encodes as {number}
   - float encodes as {number, string}
   - double encodes as {number, string}
   - bytes encodes as {string}
   - string encodes as {string}
   - any enum encodes as {string}
   - any map encodes as {object}
   - any record encodes as {object}

A union is considered *unambiguous* if the JSON type sets for all the
members of the union form mutually disjoint sets.

Note that float and double are considered ambiguous with respect to string
because in the future, Avro might support encoding NaN and infinity values
as strings.

WDYT?

2) Make decimal an avro first class type. Current logical type approach is
> not natural in JSON. (see https://issues.apache.org/jira/browse/AVRO-2164
> ).
>

> For 1.9.x    2) is probably a non-starter
>

Yes, this sounds a bit out of scope to me. It would be nice if decimal
values were represented as a human-readable decimal number (possibly a JSON
string to survive round-trips), but that should perhaps be part of a larger
change to improve decimal support in general. Interestingly, if we were to
be able to represent decimal values as JSON numbers (for example when
they're unambiguously representable as such), that would fit fine with the
above description, because bytes would be considered ambiguous with respect
to float.

  cheers,
    rog.

Re: More idiomatic JSON encoding for unions

Reply via email to