[ https://issues.apache.org/jira/browse/AVRO-3986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated AVRO-3986: --------------------------------- Labels: pull-request-available (was: ) > "Plain JSON" encoding for Apache Avro > ------------------------------------- > > Key: AVRO-3986 > URL: https://issues.apache.org/jira/browse/AVRO-3986 > Project: Apache Avro > Issue Type: New Feature > Components: interop > Reporter: Clemens Vasters > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Markdown version of this text: > [https://gist.github.com/clemensv/8145234add81633d4a21817b1e134a82] > - [Notational Conventions](#notational-conventions) > - [Interoperability issues of the Avro JSON Encoding with common JSON > usage](#interoperability-issues-of-the-avro-json-encoding-with-common-json-usage) > - [The "Plain JSON" encoding](#the-plain-json-encoding) > The Apache Avro project defines a JSON Encoding, which is optimized for > encoding > data in JSON, but primarily aimed at exchanging data between implementations > of > the Apache Avro specification. The choices made for this encoding severely > limit > the interoperability with other JSON serialization frameworks. This document > defines an alternate, additional mode for Avro JSON Encoders, preliminarily > named "Plain JSON", that specifically addresses identified interoperability > blockers. > While this document is a proposal for a set of new features in Apache Avro, > the > extensibility of Avro's schema model allows for the implementation of these > features separately from the Avro project. Out of the available and popular > schema languages for data exchange, Avro schema provides the cleanest > foundation > for mapping wire representations to programming language types and database > tables, which is why interoperability of Avro with the most popular text > encoding format for structured data, JSON, is very desirable. > With Avro's strength and focus being its binary encoding, supporting JSON is > specifically desireable in interoperability scenarios where either the > producer > or the consumer of the encoded data is using a different JSON encoding > framework, or where JSON is crafted or evaluated directly by the application. > As most JSON document instances can be structurally described by Avro Schema, > the interoperability case is for JSON data, described by Avro Schema, to be > accepted by an Apache Avro messaging application, and for that data then to be > forwarded onwards using Avro binary encoding. Reversely, it needs to be > possible > for an application to transform an Avro binary encoded data structure into > JSON > data that is understood by parties that expect to handle JSON. The kinds > applications requiring such transformation capabilities are stream processing > frameworks, API gateways and (reverse) proxies, and integration brokers. > The intent of this proposal is for the Avro "JsonEncoder" implementations to > have a new mode parameter, accepting an enumeration choice out of the options > "Avro Json" (AVRO_JSON, AvroJson, etc), which is Avro's default JSON encoding, > and "Plain JSON" (PLAIN_JSON, PlainJson, etc). The rules for the "Plain JSON" > mode are described herein. > The "Plain JSON" mode is a selector for enabling set of features that are > described below. Implementations MAY also choose for these features to be > individually selectable for the "Avro JSON" mode, for instance letting the > user > use the "Avro JSON" mode primarily, but opting into the binary data handling > or > date-time handling features described here. However, the "Plain JSON" mode > that > combines these features MUST be implemented to ensure interoperability. > Notational Conventions > The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", > "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be > interpreted as described in RFC 2119. > Interoperability issues of the Avro JSON Encoding with common JSON usage > There are several distinct issues in the Avro JSON Encoding that cause > conflicts > with common usage of JSON and many serialization frameworks. It needs to be > emphasized that none of these issues are conformance issues with the JSON > specification (RFC8259), but rather stem from the JSON specification's > inherent > limitations. JSON does not define binary data, date or time types. JSON also > has > no concept of a type-hint for data structures (i.e. objects), which would > allow > serialization frameworks to establish an unambiguous mapping between a data > type > in a programming language or schema and the encoded type in JSON. > There are, however, commonly used conventions to address these shortcomings of > the core JSON specification: > - Binary data: Binary data is commonly encoded using the base64 encoding and > stored in string-typed values. > - Date and time data: Date and time data is commonly encoded using the > RFC3339 > profile of ISO8601 and stored in string-typed values. > - Type hints: In its native type system, JSON value types are distinguished > by > notation where 'null' values, strings, numbers, booleans, arrays, and objects > are identifiable through the syntax. While JSON has no further data type > concepts, several serialization frameworks and even some standards leaning on > JSON (e.g. OpenAPI) introduce the notion of a "discriminator" property, which > is inside the encoded object and unambiguously identifies the type such > that the decoding stage can instantiate and populate the correct type > in cases where multiple candidate types exist. > On each of these items, the Avro JSON encoding's choices are in direct > conflict > with predominant practice: > - Binary data: Binary data is encoded in strings using Unicode escape > sequences > (example: "\u00DE\u00AD\u00BE\u00EF"), which leads to a 500% overhead compared > to the encoded bytes vs. a 33% overhead when using Base64. > - Date and time data: Avro handles date and time as logical types, extending > either long or int, using the UNIX epoch as the baseline. Durations are > expressed using a bespoke data structure. As there are no handling rules for > logical types in the JSON encoding, the encoded results are therefore epoch > numbers without annotations like time zone offsets. > - Type-hints: Whenever types can be ambiguous in Avro, which is the case with > type unions, the Avro JSON encoding prescribes encoding the value wrapped > inside an object with a single property where the property's name is the type > name, e.g. `"myprop": \{"string": "value"} > `. 'null' values are encoded as > 'null', e.g. `"myprop": null`. For primitive types, this is in conflict with > JSON's native type model that already makes the distinction syntactically. For > object types (Avro records), the wrapper is in conflict with standing practice > where the discriminator is inlined. > In addition, there are three general limitations of Avro's type and schema > model > that result in potential interoperability blockers: > - Avro represents decimal numeric types as a logical type annotating `fixed` > or > `byte`, which results in an encoded byte sequence in the JSON encoding that > cannot be interpreted without the Avro schema and is therefore undecipherable > for regular JSON consumers. > - `name` fields in Avro are limited to a character set that can be easily > mapped > to mostly any programming language and database, but JSON object keys are not. > - JSON documents may have top-level arrays and maps, while Avro schemas only > allow `record` and `enum` as independent types and therefore at the top-level > of a schema. > As a consequence of this, the current implementations of the Avro JSON > Encoding > do not interoperate well with "plain JSON" as input and often do not yield > useful plain JSON as output. There is a "happy path" on which the Avro JSON > Encoding does line up with common usage, but it's easy to stray off from it. > The Plain JSON encoding > The Plain JSON encoding mode of Apache Avro consists of a combination of 7 > distinct features that are defined in this section. The design is grounded in > the relevant IETF RFCs and provides the broadest interoperability with common > usage of JSON, while yet preserving type integrity and precision in all cases > where the Avro Schema is known to the decoding party. > The features are designed to be orthogonal and can be implemented separately. > - [1: Alternate names](#feature-1-alternate-names) > - [2: Avro `binary` and `fixed` type data > encoding](#feature-2-avro-binary-and-fixed-type-data-encoding) > - [3: Avro `decimal` logical type data > encoding](#feature-3-avro-decimal-logical-type-data-encoding) > - [4: Avro time, date, and duration logical > types](#feature-4-avro-time-date-and-duration-logical-types) > - [5: Handling unions with primitive type values and enum > values](#feature-5-handling-unions-with-primitive-type-values-and-enum-values) > - [6: Handling unions of record values and of > maps](#feature-6-handling-unions-of-record-values-and-of-maps) > - [7: Document root records](#feature-7-document-root-records) > Features 2, 3, 4, and 5 are trivial on all platforms and frameworks that > handle > JSON. Features 1 and 7 are hints for the JSON encoder and decoder to be able > to > handle JSON data that is not conforming to Avro's naming and structure > constraints. Feature 6 provides a mechanism to handle unions of record types > that is aligned with common JSON encodation frameworks and JSON Schema's > "oneOf" > type composition. > Feature 1: Alternate names > JSON objects allow for keys with arbitrary unicode strings, with the only > restriction being uniqueness of keys within an object. Uniqueness is a > "SHOULD" > rule in [RFC8259, Section > 4]([https://www.rfc-editor.org/rfc/rfc8259#section-4]), > which is interpreted as REQUIRED for this specification since it is common > practice. > The character set permitted for Avro names is constrained by the regular > expression `[A-Za-z_][A-Za-z0-9_]*`, which poses an interoperability problem > with JSON, especially in scenarios where internationalization is a concern. > While English is the dominant language in most developer scenarios, metadata > might be defined by end-users and in their own language. It's also fairly > common > for JSON object keys to contain word-separator characters other than '_' and > keys may quite well start with a number. > As the Avro project will presumably want to avoid introducing schema > attributes > that are JSON-specific and will want to use new schema constructs for > additional > needs as they arise, the alternate names feature introduces a map of alternate > names of which the plain JSON feature reserves a key: > `altnames` map > Wherever Avro Schema requires a `name` field, an `altnames` map MAY be defined > alongside the `name` field, which provides a map of alternate names. Those > names > may be local-language identifiers, display names, or names that contain > characters disallowed in Avro. The map key identifies the context in which the > alternate name is used. > This specification reserves the `json` key in the `altnames` map. > > A display-name feature might reserve `display: > {IANA-subtag} > ` as keys. This > > assumed convention is used in the following example just for illustration > > of the > > `altnames` feature. > Assume the following JSON input document with German-language keys that > represents a row in commercial order document: > ```JSON > { > "Artikelschlüssel": "1234", > "Stückzahl": 42, > "Größe": "Extragroß" > } > ``` > Without the alternate names feature, the Avro schema would not be able to > match > the keys in the JSON document since `ü` and `ß` are not allowed. With the > alternate names feature, the schema can be defined as follows: > ```JSON > { > "type": "record", > "namespace": "com.example", > "name": "Article", > "fields": [ > { > "name": "articleKey", > "type": "string", > "altnames": > { "json": "Artikelschlüssel", "display:de": "Artikelschlüssel", "display:en": > "Article Key" } > }, > { > "name": "quantity", > "type": "int", > "altnames": > { "json": "Stückzahl", "display:de": "Stückzahl", "display:en": "Quantity" } > }, > { > "name": "size", > "type": "sizeEnum", > "altnames": > { "json": "Größe", "display:de": "Größe", "display:en": "Size" } > } > ] > } > ``` > When the JSON decoder (de-)encodes a named item, the encoder MUST use the > value from the `altnames` entry with the `json` key as the name for the > corresponding JSON element, when present > `altsymbols` map > The `altsymbols` map is a similar feature to `altnames`, but it is used for > alternate names of enum symbols. The `altsymbols` map provides alternate names > for symbols. As with `altnames`, the `altsymbols` map key identifies the > context > in which the alternate name is used. The values of the `altsymbols` map are > maps > where the keys are symbols as defined in the `symbols` field and the values > are > the corresponding alternate names. > Any symbol key present in the `altsymbols` map MUST exist in the `symbols` > field. Symbols in the `symbols` field MAY be omitted from the `altsymbols` > map. > ```JSON > { > "type": "enum", > "name": "sizeEnum", > "symbols": ["S", "M", "L", "XL"], > "altsymbols": { > "json": > { "S": "Klein", "M": "Mittel", "L": "Groß", "XL": "Extragroß" } > , > "display:en": > { "S": "Small", "M": "Medium", "L": "Large", "XL": "Extra Large" } > } > } > ``` > When the JSON decoder (de-)encodes an enum symbol, the encoder MUST use the > value from the `altsymbols` entry with the `json` key as the string > representing > the enum value, when present. > Feature 2: Avro `binary` and `fixed` type data encoding > When encoding data typed with the Avro `binary` or `fixed` types, the byte > sequence is encoded into and from Base64 encoded string values, conforming > with > IETF RFC4648, Section 4. > Feature 3: Avro `decimal` logical type data encoding > When encoding data typed with the Avro logical `decimal` type, the numeric > value is encoded into a from a JSON `number` value. JSON numbers are > represented > as text and do not lose precision as IEEE754 floating points do. > When using a JSON library to implement the encoding, decimal values MUST NOT > be > converted through an IEEE floating point type (e.g. double or float in most > programming languages) but must use the native decimal data type. > Feature 4: Avro time, date, and duration logical types > When encoding data typed with one of Avro's logical data types for dates and > times, the data is encoded into and from a JSON `string` value, which is an > expression as defined in IETF RFC3339. > Specifically, the logical types are mapped to certain grammar elements > defined > in RFC3339 as defined in the following table: > |logicalType|RFC3339 grammar element| > |------------------------|-------------------------------------------------------------| > |`date`|RFC3339 5.6. “full-date”| > |`time-millis`|RFC3339 5.6. “date-time”| > |`time-micros`|RFC3339 5.6. “partial-time”| > |`timestamp-millis`|RFC3339 5.6 “date-time”| > |`timestamp-micros`|RFC3339 5.6 “date-time”| > |`local-timestamp-millis`|RFC3339 5.6 “date-time”, ignoring offset (note RFC > 3339 4.4)| > |`local-timestamp-micros`|RFC3339 5.6 “date-time” , ignoring offset (note RFC > 3339 4.4)| > |`duration`|RFC3339 Appendix A “duration”| > Feature 5: Handling unions with primitive type values and enum values > Unions of primitive types and of enum values are handled through JSON values' > (RFC8259, Section 3) ability to reflect variable types. > Given a type union of `[string, null]` and a string value "test", a encoded > field named "example" is encoded as `"example": null` or `"example": "test"`. > For null-valued fields, the JSON encoder MAY omit the field entirely. During > decoding, missing fields are set to null. If a default value is defined for > the > field, decoding MUST set the field value to the default value. > For a type union of `[string,int]` and string values "2" and the int value 2, > a > encoded field named "example" is encoded as `"example": "2"` > or `"example":2`. > For a type union of `[null, myEnum]` with myEnum being an enum type having > symbols "test1" and "test2", a encoded field named "example" is encoded as > `"example": null` or `"example": "test1"` or `"example": "test2"`. > Instances of unions of primitive types with arrays and records or maps can > also > be distinguished through the JSON grammar and type model. Unions of multiple > records are discussed in Feature 6 below. > For completeness, these are the updated type mappings of Avro types to JSON > types for the plain JSON encoding. > |Avro type|JSON type|Notes| > |------------|---------|-----------------------------------------------------------------------------------| > |null|null|The field MAY be omitted| > |boolean|boolean| | > |int,long|integer| | > |float,double|number| | > |bytes|string|Base64 string, see [Feature > 2](#feature-2-avro-binary-and-fixed-type-data-encoding)| > |string|string| | > |record|object| | > |enum|string| | > |array|array| | > |map|object| | > |fixed|string|Base64 string, see [Feature > 2](#feature-2-avro-binary-and-fixed-type-data-encoding)| > |date/time|string|See [Feature > 4](#feature-4-avro-time-date-and-duration-logical-types)| > |UUID|string| | > |decimal|number|See [Feature > 3](#feature-3-avro-decimal-logical-type-data-encoding)| > Feature 6: Handling unions of record values and of maps > As discussed in the overview, JSON does not have an inherent concept of a > type-hint that allows distinguishing object data types. Indeed, it has no > concept of constraining and further specifying the `object` type, at all. > The JSON Schema project has defined a schema language specifically for JSON > data > and provides a type concept for `object`. In JSON interoperability scenarios, > JSON Schema, or frameworks that infer their type concepts from JSON Schema, > will > often play a role on the producer or consumer side due to its popularity. > JSON Schema is primarily a schema model that serves to validate JSON > documents. > Its "oneOf" type composition construct is equivalent to Avro's union concept > in > function. Out of a choice of multiple type options, exactly one option MUST > match the JSON element that is being validated, otherwise the validation > fails. > Any implementation of a JSON Schema validator must therefore be able to test > the > given JSON element against all available options and then determine the > matching > type option. Any implementation of a schema driven decoder can use the > same strategy to select which type to instantiate and populate. > JSON Schema does not define a type-hint for this purpose, but makes it the > schema designer's task to create type definitions that are structurally > distinct > such that the "oneOf" test always yields one of the types when given JSON > element instances. Schema designers then occasionally resort to introducing > their own type-hints by either defining a discriminator property with a > single-value `enum` or with a `const` value, where the discriminator property > name is the same across the type options, but the values of the `enum` or > `const` are different. We will lean on this practice in the following. > Type structure matching > Consider the following Avro schema with a type union of two record types: > ```JSON > { > "type": "record", > "name": "ContactList", > "fields": [ > { > "name": "contacts", > "type": "array", > "items": [ > { > "type": "record", > "name": "CustomerRecord", > "fields": [ > {"name": "name", "type": "string"} > , > {"name": "age", "type": "int"} \{"name": "customerId", "type": "string"} > ] > }, > { > "type": "record", > "name": "EmployeeRecord", > "fields": [ > {"name": "name", "type": "string"} > , > {"name": "age", "type": "int"} > , > {"name": "employeeId", "type": "string"} > ] > } > ] > } > ] > } > ``` > Now consider the following JSON document: > ```JSON > { > "contacts": [ > {"name": "Alice", "age": 42, "customerId": "1234"} > , > {"name": "Bob", "age": 43, "employeeId": "5678"} > ] > } > ``` > We can clearly distinguish the two record types by the presence of the > respectively required `customerId` or `employeeId` field. > When decoding a type union, the JSON decoder MUST test the JSON element > against > all available type options. A JSON element matches if it can be correctly and > completely decoded given the type-union candidate schema, including all > applicable nested or referenced definitions. If more than one of the options > matches, decoding MUST fail. The JSON decoder MUST select the type option that > matches the JSON element and instantiate and populate the corresponding type. > For performance reasons, it is highly desirable to avoid having to test a JSON > element against all possible type options in a union and instead have a single > property that can be tested first and short-circuits the type matching > process. > We discuss that next. > Discriminator property > When we assume the Avro schema to be slightly different, we might end up with > an > ambiguity that is not as easy to resolve. Let the `employeeId` and > `customerId` fields > be optional in the schema above, both typed as `["string", "null"]`. > When we now consider the following JSON document, we can't decide on the type > and will fail decoding: > ```JSON > { > "contacts": [ > {"name": "Alice", "age": 42} > , > {"name": "Bob", "age": 43} > ] > } > ``` > To resolve this ambiguity, we can introduce a discriminator property that > clearly identifies the type of the record. > Instead of introducing a schema attribute that is specific to JSON, we instead > introduce a new Avro schema attribute `const` that defines a constant value > for > the field it is defined on. > The value of the `const` field must match the field type. The value of the > field > MUST always match the `const` value. During decoding, decoding MUST fail if > the > field value is not equal to the `const` value. This rule ensures the function > of > `const` as a discriminator. The `const` field is only allowed on fields of > primitive types and enum types. > Consider this Avro schema: > ```JSON > { > "type": "record", > "name": "ContactList", > "fields": [ > { > "name": "contacts", > "type": "array", > "items": [ > { > "type": "record", > "name": "CustomerRecord", > "fields": [ > {"name": "name", "type": "string"} > , > {"name": "age", "type": "int"} > , > {"name": "customerId", "type": ["string", "null"]} > , > {"name": "type", "type": "string", "const": "customer"} > ] > }, > { > "type": "record", > "name": "EmployeeRecord", > "fields": [ > {"name": "name", "type": "string"} > , > {"name": "age", "type": "int"} > , > {"name": "employeeId", "type": ["string", "null"]} > , > {"name": "type", "type": "string", "const": "employee"} > ] > } > ] > } > ] > } > ``` > The JSON document MUST now include the discriminator: > ```JSON > { > "contacts": [ > {"name": "Alice", "age": 42, "type": "customer"} > , > {"name": "Bob", "age": 43, "type": "employee"} > ] > } > ``` > The `const` field MAY otherwise be used for any other purpose. The binary > decoder MAY skip encoding and decoding a field with a `const` attribute and > instead always return the constant value for the field similar to how the > `default` field is handled. The `const` value overrides the `default` value. > During encoding, the binary encoder SHOULD check that the field value matches > the `const` value and MAY fail encoding if it does not. > Feature 7: Document root records > Avro schemas are defined as a single record or enum type at the top level or > as > a top-level type union. JSON documents, however, may have top-level arrays and > maps. Without changing the fundamental Avro schema model, the plain JSON > encoding mode uses an annotation on `array` and `map` types defined inside > `record` types to allow for top-level arrays and maps in the JSON document. > The annotation is a boolean flag named `root` that is set to `true` on one > record field's array or map type. The `root` flag is only defined for `array` > and `map` types. If the `root` flag is present and has the value `true`, the > enclosing `record` type MUST have exactly this one field. > Given a JSON document with a top-level array like this: > ```JSON > [ > {"name": "Alice", "age": 42} > , > {"name": "Bob", "age": 43} > ] > ``` > The Avro schema would be defined as follows: > ```JSON > { > "type": "record", > "name": "PersonDocument", > "fields": [ > { > "name": "persons", > "type": { > "type": "array", > "root": true, > "items": { > "type": "record", > "name": "PersonRecord", > "fields": [ > {"name": "name", "type": "string"} > , > {"name": "age", "type": "int"} > ] > } > } > } > ] > } > ``` > When the JSON decoder encounters a top-level array or map, it MUST match the > array or map to the field with the `root` flag set to `true`. When the `root` > flag is present on a field, the JSON encoder MUST yield the encoding of the > field as the encoding of the entire record. The JSON encoder MUST fail if the > `root` flag is set to `true` and if there is more than one field in the > record. > When such a record type is used as a field type inside another record, it > consequently is always represented equivalent to a `map` or `array` type in > the > JSON document. > In [type structure matching](#type-structure-matching) scenarios, a set `root` > on a `map` type causes the record type to be a candidate for the type matching > of JSON `object` values. The `root` flag on an `array` type causes the record > type to be a candidate for the type matching of JSON `array` values. > The Avro binary encoding is not functionally affected by this feature, but the > structural constraint imposed by the `root` flag MAY be enforced by the > encoder. -- This message was sent by Atlassian Jira (v8.20.10#820010)