Thank you Clemens, This is a very detailed set of proposals, and it looks like it would work.
I do however, feel we'd need to define a way to unions with records. Your proposal lists various options, of which the discriminatory option seems most portable to me. You mention the "displayName" proposal. I don't like that, as it mixes data with UI elements. The discriminator option can specify a fixed or configurable field to hold the type of the record. Kind regards, Oscar -- Oscar Westra van Holthe - Kind <os...@westravanholthe.nl> Op do 18 apr. 2024 10:12 schreef Clemens Vasters via user < user@avro.apache.org>: > Hi everyone, > > > > the current JSON Encoding approach severely limits interoperability with > other JSON serialization frameworks. In my view, the JSON Encoding is only > really useful if it acts as a bridge into and from JSON-centric > applications and it currently gets in its own way. > > > > The current encoding being what it is, there should be an alternate mode > that emphasizes interoperability with JSON “as-is” and allows Avro Schema > to describe existing JSON document instances such that I can take someone’s > existing JSON document in on one side of a piece of software and emit Avro > binary on the other side while acting on the same schema. > > > > There are four specific issues: > > > > 1. Binary Values > 2. Unions with Primitive Type Values and Enum Values > 3. Unions with Record Values > 4. DateTime > > > > One by one: > > > > 1. Binary values: > > --------------------- > > > > Binary values are (fixed and bytes) are encoded as escaped unicode > literals. While I appreciate the creative trick, it costs 6 bytes for each > encoded byte. I have a hard time finding any JSON libraries that provide a > conversion of such strings from/to byte arrays, so this approach appears to > be idiosyncratic for Avro’s JSON Encoding. > > > > The common way to encode binary in JSON is to use base64 encoding and that > is widely and well supported in libraries. Base64 is 33% larger than plain > bytes, the encoding chosen here is 500% (!) larger than plain bytes. > > > > The Avro decoder is schema-informed and it knows that a field is expected > to hold bytes, so it’s easy to mandate base64 for the field content in the > alternate mode. > > > > 2. Unions with Primitive Type Values and Enum Values > > --------------------- > > > > It’s common to express optionality in Avro Schema by creating a union with > the “null” type, e.g. [“string”, “null”]. The Avro JSON Encoding opts to > encode such unions, like any union, as { “{type}”: {value} } when the value > is non-null. > > > > This choice ignores common practice and the fact that JSON’s values are > dynamically typed (RFC8259 Section-3 > <https://www.rfc-editor.org/rfc/rfc8259#section-3>) and inherently > accommodate unions. The conformant way to encode a value choice of null or > “string” into a JSON value is plainly null and “string”. > > > > “foo” : null > > “foo”: “value” > > > > The “field default values” table in the Avro spec maps Avro types to the > JSON types null, boolean, integer, number, string, object, and array, all > of which can be encoded into and, more importantly, unambiguously decoded > from a JSON value. The only semi-ambiguous case is integer vs. number, > which is a convention in JSON rather than a distinct type, but any Avro > serializer is guided by type information and can easily make that > distinction. > > > > 3. Unions with Record Values > > --------------------- > > > > The JSON Encoding pattern of unions also covers “record” typed values, of > course, and this is indeed a tricky scenario during deserialization since > JSON does not have any built-in notion of type hints for “object” typed > values. > > > > The problem of having to disambiguate instances of different types in a > field value is a common one also for users of JSON Schema when using the > “oneOf” construct, which is equivalent to Avro unions. There are two common > strategies: > > > > - “Duck Typing”: Every conformant JSON Schema Validator determines the > validity of a JSON node against a “oneOf" rule by testing the instance > against all available alternative schema definitions. Validation fails if > there is not exactly one valid match. > > - Discriminators: OpenAPI, for instance, mandates a “discriminator” field > (see https://spec.openapis.org/oas/latest.html#discriminator-object) for > disambiguating “oneOf” constructs, whereby the discriminator property is > part of each instance. That approach informs numerous JSON serialization > frameworks, which implement discriminators under that assumption. > > > > The Java Jackson library indeed supports the Avro JSON Encoding’s style of > putting the discriminator into a wrapper field name (JsonTypeInfo > annotation, JsonTypeInfo.As.WRAPPER_OBJECT). Many other frameworks only > support the property approach, though, including the two dominant ones for > .NET, Pydantic of Python, and others. There’s tooling like Redocly that > flags that approach as a “mistake” (see > https://redocly.com/docs/resources/discriminator/#property-outside-of-the-object > ). > > > > What that means is that most existing JSON instances with ambiguous types > will either use property discriminators or the implementation will rely on > duck typing as JSON Schema does for validation. The Avro JSON Encoding > approach is rare and is also counterintuitive for anyone comparing the > declared object structure and the JSON structure who is not familiar with > Avro’s encoding rules. It has confused a lot of people in our house, for > sure. > > > > Proposed is the following approach: > > > > a) add a new, optional “const” attribute that can be applied to any record > field declaration that is of a primitive type. When present, the attribute > causes the field to always have this value. In Avro binary encoding, the > field is not transmitted, at all, but the decoder yields it with the given > value. In Avro JSON encoding, the field is emitted and for serialization to > succeed for the record type, the field must be present with the given value. > > b) perform disambiguation of types by the same principle as JSON Schema > for oneOf, with a performance preference for matching fields flagged with > “const” against the incoming JSON node. When the deserializer is configured > by schema to know what fields and values to look for, there should not be > no performance hit compared to the current approach. Derialization fails > if there is not one unambiguous match. That is exactly in line with what > JSON Schema validation implementations do. JSON Schema also has a “const” > construct. “Const” or single-valued enums are often used as discriminator > helpers with JSON Schema’s oneOf. > > c) optional: add a new, optional “displayname” attribute that can hold an > alternate name for the field without the restrictions of the “name” > character set, so that discriminators like “$type” can be matched. A > further upside of adding this field is that it can generally be used to > match international characters in JSON object keys, which are obviously > permitted there. > > > > 4. Date Time > > --------------------- > > > > JSON data generally leans on the RFC3339 profile of ISO8601 for dates and > durations, not the last because JSON Schema defines these choices as > “format” variants for strings. > > > > If the incoming type of a field is a string instead of a number, JSON > deserialization in the alternate mode should interpret the logicalTypes for > dates as follows. > > > > - “date” – RFC3339 5.6. “full-date” > - “time-millis” – RFC3339 5.6. “date-time” > - “time-micros” – RFC3339 5.6. “partial-time” > - “timestamp-millis” – RFC3339 5.6 “date-time” > - “timestamp-micros”—RFC3339 5.6 “date-time” > - “local-timestamp-millis” – RFC3339 5.6 “date-time”, ignoring offset > (but see RFC 3339 4.4) > - “local-timestamp-micros”—RFC3339 5.6 “date-time” , ignoring offset > (but see RFC 3339 4.4) > - “duration” – RFC3339 Appendix A “duration” > > > > The JSON serialization in the alternate mode should have an option, and > default to, serializing dates as strings. Deserialization parsers MAY be > lenient and also accept RFC1123 5.2.13 date time strings where RFC3339 5.6 > “date-time” is specified, but I’d make that an implementation choice. > > > > > > Best Regards > > Clemens Vasters > > >