Hi Clemens, I propose to wait a bit to give a chance to the community to review your email and points.
Then, we will create the Jira accordingly. Regards JB On Thu, Apr 18, 2024 at 9:20 AM Clemens Vasters <cleme...@microsoft.com> wrote: > > Hi JB, > > > > I have not done that yet. I’m happy to break that up into items once I get > the sense that this is a direction we can get to a consensus on. > > > > Shall I file the whole email as a “New Feature” issue first? > > > > Thanks > > Clemens > > > > From: Jean-Baptiste Onofré <j...@nanthrax.net> > Sent: Thursday, April 18, 2024 10:17 AM > To: Clemens Vasters <cleme...@microsoft.com>; user@avro.apache.org > Subject: Re: Avro JSON Encoding > > > > Hi Clemens > > > > Thanks for the detailed email. > > > > Quick question : did you already create Jira about each improvements/issues ? > > > > I will take the time to read asap. > > > > Thanks > > Regards > > JB > > > > Le jeu. 18 avr. 2024 à 09:12, Clemens Vasters via user <user@avro.apache.org> > a écrit : > > Hi everyone, > > > > the current JSON Encoding approach severely limits interoperability with > other JSON serialization frameworks. In my view, the JSON Encoding is only > really useful if it acts as a bridge into and from JSON-centric applications > and it currently gets in its own way. > > > > The current encoding being what it is, there should be an alternate mode that > emphasizes interoperability with JSON “as-is” and allows Avro Schema to > describe existing JSON document instances such that I can take someone’s > existing JSON document in on one side of a piece of software and emit Avro > binary on the other side while acting on the same schema. > > > > There are four specific issues: > > > > Binary Values > Unions with Primitive Type Values and Enum Values > Unions with Record Values > DateTime > > > > One by one: > > > > 1. Binary values: > > --------------------- > > > > Binary values are (fixed and bytes) are encoded as escaped unicode literals. > While I appreciate the creative trick, it costs 6 bytes for each encoded > byte. I have a hard time finding any JSON libraries that provide a conversion > of such strings from/to byte arrays, so this approach appears to be > idiosyncratic for Avro’s JSON Encoding. > > > > The common way to encode binary in JSON is to use base64 encoding and that is > widely and well supported in libraries. Base64 is 33% larger than plain > bytes, the encoding chosen here is 500% (!) larger than plain bytes. > > > > The Avro decoder is schema-informed and it knows that a field is expected to > hold bytes, so it’s easy to mandate base64 for the field content in the > alternate mode. > > > > 2. Unions with Primitive Type Values and Enum Values > > --------------------- > > > > It’s common to express optionality in Avro Schema by creating a union with > the “null” type, e.g. [“string”, “null”]. The Avro JSON Encoding opts to > encode such unions, like any union, as { “{type}”: {value} } when the value > is non-null. > > > > This choice ignores common practice and the fact that JSON’s values are > dynamically typed (RFC8259 Section-3) and inherently accommodate unions. The > conformant way to encode a value choice of null or “string” into a JSON value > is plainly null and “string”. > > > > “foo” : null > > “foo”: “value” > > > > The “field default values” table in the Avro spec maps Avro types to the JSON > types null, boolean, integer, number, string, object, and array, all of which > can be encoded into and, more importantly, unambiguously decoded from a JSON > value. The only semi-ambiguous case is integer vs. number, which is a > convention in JSON rather than a distinct type, but any Avro serializer is > guided by type information and can easily make that distinction. > > > > 3. Unions with Record Values > > --------------------- > > > > The JSON Encoding pattern of unions also covers “record” typed values, of > course, and this is indeed a tricky scenario during deserialization since > JSON does not have any built-in notion of type hints for “object” typed > values. > > > > The problem of having to disambiguate instances of different types in a field > value is a common one also for users of JSON Schema when using the “oneOf” > construct, which is equivalent to Avro unions. There are two common > strategies: > > > > - “Duck Typing”: Every conformant JSON Schema Validator determines the > validity of a JSON node against a “oneOf" rule by testing the instance > against all available alternative schema definitions. Validation fails if > there is not exactly one valid match. > > - Discriminators: OpenAPI, for instance, mandates a “discriminator” field > (see https://spec.openapis.org/oas/latest.html#discriminator-object) for > disambiguating “oneOf” constructs, whereby the discriminator property is part > of each instance. That approach informs numerous JSON serialization > frameworks, which implement discriminators under that assumption. > > > > The Java Jackson library indeed supports the Avro JSON Encoding’s style of > putting the discriminator into a wrapper field name (JsonTypeInfo annotation, > JsonTypeInfo.As.WRAPPER_OBJECT). Many other frameworks only support the > property approach, though, including the two dominant ones for .NET, Pydantic > of Python, and others. There’s tooling like Redocly that flags that approach > as a “mistake” (see > https://redocly.com/docs/resources/discriminator/#property-outside-of-the-object). > > > > What that means is that most existing JSON instances with ambiguous types > will either use property discriminators or the implementation will rely on > duck typing as JSON Schema does for validation. The Avro JSON Encoding > approach is rare and is also counterintuitive for anyone comparing the > declared object structure and the JSON structure who is not familiar with > Avro’s encoding rules. It has confused a lot of people in our house, for sure. > > > > Proposed is the following approach: > > > > a) add a new, optional “const” attribute that can be applied to any record > field declaration that is of a primitive type. When present, the attribute > causes the field to always have this value. In Avro binary encoding, the > field is not transmitted, at all, but the decoder yields it with the given > value. In Avro JSON encoding, the field is emitted and for serialization to > succeed for the record type, the field must be present with the given value. > > b) perform disambiguation of types by the same principle as JSON Schema for > oneOf, with a performance preference for matching fields flagged with “const” > against the incoming JSON node. When the deserializer is configured by schema > to know what fields and values to look for, there should not be no > performance hit compared to the current approach. Derialization fails if > there is not one unambiguous match. That is exactly in line with what JSON > Schema validation implementations do. JSON Schema also has a “const” > construct. “Const” or single-valued enums are often used as discriminator > helpers with JSON Schema’s oneOf. > > c) optional: add a new, optional “displayname” attribute that can hold an > alternate name for the field without the restrictions of the “name” character > set, so that discriminators like “$type” can be matched. A further upside of > adding this field is that it can generally be used to match international > characters in JSON object keys, which are obviously permitted there. > > > > 4. Date Time > > --------------------- > > > > JSON data generally leans on the RFC3339 profile of ISO8601 for dates and > durations, not the last because JSON Schema defines these choices as “format” > variants for strings. > > > > If the incoming type of a field is a string instead of a number, JSON > deserialization in the alternate mode should interpret the logicalTypes for > dates as follows. > > > > “date” – RFC3339 5.6. “full-date” > “time-millis” – RFC3339 5.6. “date-time” > “time-micros” – RFC3339 5.6. “partial-time” > “timestamp-millis” – RFC3339 5.6 “date-time” > “timestamp-micros”—RFC3339 5.6 “date-time” > “local-timestamp-millis” – RFC3339 5.6 “date-time”, ignoring offset (but see > RFC 3339 4.4) > “local-timestamp-micros”—RFC3339 5.6 “date-time” , ignoring offset (but see > RFC 3339 4.4) > “duration” – RFC3339 Appendix A “duration” > > > > The JSON serialization in the alternate mode should have an option, and > default to, serializing dates as strings. Deserialization parsers MAY be > lenient and also accept RFC1123 5.2.13 date time strings where RFC3339 5.6 > “date-time” is specified, but I’d make that an implementation choice. > > > > > > Best Regards > > Clemens Vasters > >