Clemens Vasters created AVRO-3986: ------------------------------------- Summary: "Plain JSON" encoding for Apache Avro Key: AVRO-3986 URL: https://issues.apache.org/jira/browse/AVRO-3986 Project: Apache Avro Issue Type: New Feature Components: interop Reporter: Clemens Vasters
Markdown version of this text: [https://gist.github.com/clemensv/8145234add81633d4a21817b1e134a82] - [Notational Conventions](#notational-conventions) - [Interoperability issues of the Avro JSON Encoding with common JSON usage](#interoperability-issues-of-the-avro-json-encoding-with-common-json-usage) - [The "Plain JSON" encoding](#the-plain-json-encoding) The Apache Avro project defines a JSON Encoding, which is optimized for encoding data in JSON, but primarily aimed at exchanging data between implementations of the Apache Avro specification. The choices made for this encoding severely limit the interoperability with other JSON serialization frameworks. This document defines an alternate, additional mode for Avro JSON Encoders, preliminarily named "Plain JSON", that specifically addresses identified interoperability blockers. While this document is a proposal for a set of new features in Apache Avro, the extensibility of Avro's schema model allows for the implementation of these features separately from the Avro project. Out of the available and popular schema languages for data exchange, Avro schema provides the cleanest foundation for mapping wire representations to programming language types and database tables, which is why interoperability of Avro with the most popular text encoding format for structured data, JSON, is very desirable. With Avro's strength and focus being its binary encoding, supporting JSON is specifically desireable in interoperability scenarios where either the producer or the consumer of the encoded data is using a different JSON encoding framework, or where JSON is crafted or evaluated directly by the application. As most JSON document instances can be structurally described by Avro Schema, the interoperability case is for JSON data, described by Avro Schema, to be accepted by an Apache Avro messaging application, and for that data then to be forwarded onwards using Avro binary encoding. Reversely, it needs to be possible for an application to transform an Avro binary encoded data structure into JSON data that is understood by parties that expect to handle JSON. The kinds applications requiring such transformation capabilities are stream processing frameworks, API gateways and (reverse) proxies, and integration brokers. The intent of this proposal is for the Avro "JsonEncoder" implementations to have a new mode parameter, accepting an enumeration choice out of the options "Avro Json" (AVRO_JSON, AvroJson, etc), which is Avro's default JSON encoding, and "Plain JSON" (PLAIN_JSON, PlainJson, etc). The rules for the "Plain JSON" mode are described herein. The "Plain JSON" mode is a selector for enabling set of features that are described below. Implementations MAY also choose for these features to be individually selectable for the "Avro JSON" mode, for instance letting the user use the "Avro JSON" mode primarily, but opting into the binary data handling or date-time handling features described here. However, the "Plain JSON" mode that combines these features MUST be implemented to ensure interoperability. Notational Conventions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. Interoperability issues of the Avro JSON Encoding with common JSON usage There are several distinct issues in the Avro JSON Encoding that cause conflicts with common usage of JSON and many serialization frameworks. It needs to be emphasized that none of these issues are conformance issues with the JSON specification (RFC8259), but rather stem from the JSON specification's inherent limitations. JSON does not define binary data, date or time types. JSON also has no concept of a type-hint for data structures (i.e. objects), which would allow serialization frameworks to establish an unambiguous mapping between a data type in a programming language or schema and the encoded type in JSON. There are, however, commonly used conventions to address these shortcomings of the core JSON specification: - Binary data: Binary data is commonly encoded using the base64 encoding and stored in string-typed values. - Date and time data: Date and time data is commonly encoded using the RFC3339 profile of ISO8601 and stored in string-typed values. - Type hints: In its native type system, JSON value types are distinguished by notation where 'null' values, strings, numbers, booleans, arrays, and objects are identifiable through the syntax. While JSON has no further data type concepts, several serialization frameworks and even some standards leaning on JSON (e.g. OpenAPI) introduce the notion of a "discriminator" property, which is inside the encoded object and unambiguously identifies the type such that the decoding stage can instantiate and populate the correct type in cases where multiple candidate types exist. On each of these items, the Avro JSON encoding's choices are in direct conflict with predominant practice: - Binary data: Binary data is encoded in strings using Unicode escape sequences (example: "\u00DE\u00AD\u00BE\u00EF"), which leads to a 500% overhead compared to the encoded bytes vs. a 33% overhead when using Base64. - Date and time data: Avro handles date and time as logical types, extending either long or int, using the UNIX epoch as the baseline. Durations are expressed using a bespoke data structure. As there are no handling rules for logical types in the JSON encoding, the encoded results are therefore epoch numbers without annotations like time zone offsets. - Type-hints: Whenever types can be ambiguous in Avro, which is the case with type unions, the Avro JSON encoding prescribes encoding the value wrapped inside an object with a single property where the property's name is the type name, e.g. `"myprop": \{"string": "value"} `. 'null' values are encoded as 'null', e.g. `"myprop": null`. For primitive types, this is in conflict with JSON's native type model that already makes the distinction syntactically. For object types (Avro records), the wrapper is in conflict with standing practice where the discriminator is inlined. In addition, there are three general limitations of Avro's type and schema model that result in potential interoperability blockers: - Avro represents decimal numeric types as a logical type annotating `fixed` or `byte`, which results in an encoded byte sequence in the JSON encoding that cannot be interpreted without the Avro schema and is therefore undecipherable for regular JSON consumers. - `name` fields in Avro are limited to a character set that can be easily mapped to mostly any programming language and database, but JSON object keys are not. - JSON documents may have top-level arrays and maps, while Avro schemas only allow `record` and `enum` as independent types and therefore at the top-level of a schema. As a consequence of this, the current implementations of the Avro JSON Encoding do not interoperate well with "plain JSON" as input and often do not yield useful plain JSON as output. There is a "happy path" on which the Avro JSON Encoding does line up with common usage, but it's easy to stray off from it. The Plain JSON encoding The Plain JSON encoding mode of Apache Avro consists of a combination of 7 distinct features that are defined in this section. The design is grounded in the relevant IETF RFCs and provides the broadest interoperability with common usage of JSON, while yet preserving type integrity and precision in all cases where the Avro Schema is known to the decoding party. The features are designed to be orthogonal and can be implemented separately. - [1: Alternate names](#feature-1-alternate-names) - [2: Avro `binary` and `fixed` type data encoding](#feature-2-avro-binary-and-fixed-type-data-encoding) - [3: Avro `decimal` logical type data encoding](#feature-3-avro-decimal-logical-type-data-encoding) - [4: Avro time, date, and duration logical types](#feature-4-avro-time-date-and-duration-logical-types) - [5: Handling unions with primitive type values and enum values](#feature-5-handling-unions-with-primitive-type-values-and-enum-values) - [6: Handling unions of record values and of maps](#feature-6-handling-unions-of-record-values-and-of-maps) - [7: Document root records](#feature-7-document-root-records) Features 2, 3, 4, and 5 are trivial on all platforms and frameworks that handle JSON. Features 1 and 7 are hints for the JSON encoder and decoder to be able to handle JSON data that is not conforming to Avro's naming and structure constraints. Feature 6 provides a mechanism to handle unions of record types that is aligned with common JSON encodation frameworks and JSON Schema's "oneOf" type composition. Feature 1: Alternate names JSON objects allow for keys with arbitrary unicode strings, with the only restriction being uniqueness of keys within an object. Uniqueness is a "SHOULD" rule in [RFC8259, Section 4]([https://www.rfc-editor.org/rfc/rfc8259#section-4]), which is interpreted as REQUIRED for this specification since it is common practice. The character set permitted for Avro names is constrained by the regular expression `[A-Za-z_][A-Za-z0-9_]*`, which poses an interoperability problem with JSON, especially in scenarios where internationalization is a concern. While English is the dominant language in most developer scenarios, metadata might be defined by end-users and in their own language. It's also fairly common for JSON object keys to contain word-separator characters other than '_' and keys may quite well start with a number. As the Avro project will presumably want to avoid introducing schema attributes that are JSON-specific and will want to use new schema constructs for additional needs as they arise, the alternate names feature introduces a map of alternate names of which the plain JSON feature reserves a key: `altnames` map Wherever Avro Schema requires a `name` field, an `altnames` map MAY be defined alongside the `name` field, which provides a map of alternate names. Those names may be local-language identifiers, display names, or names that contain characters disallowed in Avro. The map key identifies the context in which the alternate name is used. This specification reserves the `json` key in the `altnames` map. > A display-name feature might reserve `display: {IANA-subtag} ` as keys. This > assumed convention is used in the following example just for illustration of > the > `altnames` feature. Assume the following JSON input document with German-language keys that represents a row in commercial order document: ```JSON { "Artikelschlüssel": "1234", "Stückzahl": 42, "Größe": "Extragroß" } ``` Without the alternate names feature, the Avro schema would not be able to match the keys in the JSON document since `ü` and `ß` are not allowed. With the alternate names feature, the schema can be defined as follows: ```JSON { "type": "record", "namespace": "com.example", "name": "Article", "fields": [ { "name": "articleKey", "type": "string", "altnames": { "json": "Artikelschlüssel", "display:de": "Artikelschlüssel", "display:en": "Article Key" } }, { "name": "quantity", "type": "int", "altnames": { "json": "Stückzahl", "display:de": "Stückzahl", "display:en": "Quantity" } }, { "name": "size", "type": "sizeEnum", "altnames": { "json": "Größe", "display:de": "Größe", "display:en": "Size" } } ] } ``` When the JSON decoder (de-)encodes a named item, the encoder MUST use the value from the `altnames` entry with the `json` key as the name for the corresponding JSON element, when present `altsymbols` map The `altsymbols` map is a similar feature to `altnames`, but it is used for alternate names of enum symbols. The `altsymbols` map provides alternate names for symbols. As with `altnames`, the `altsymbols` map key identifies the context in which the alternate name is used. The values of the `altsymbols` map are maps where the keys are symbols as defined in the `symbols` field and the values are the corresponding alternate names. Any symbol key present in the `altsymbols` map MUST exist in the `symbols` field. Symbols in the `symbols` field MAY be omitted from the `altsymbols` map. ```JSON { "type": "enum", "name": "sizeEnum", "symbols": ["S", "M", "L", "XL"], "altsymbols": { "json": { "S": "Klein", "M": "Mittel", "L": "Groß", "XL": "Extragroß" } , "display:en": { "S": "Small", "M": "Medium", "L": "Large", "XL": "Extra Large" } } } ``` When the JSON decoder (de-)encodes an enum symbol, the encoder MUST use the value from the `altsymbols` entry with the `json` key as the string representing the enum value, when present. Feature 2: Avro `binary` and `fixed` type data encoding When encoding data typed with the Avro `binary` or `fixed` types, the byte sequence is encoded into and from Base64 encoded string values, conforming with IETF RFC4648, Section 4. Feature 3: Avro `decimal` logical type data encoding When encoding data typed with the Avro logical `decimal` type, the numeric value is encoded into a from a JSON `number` value. JSON numbers are represented as text and do not lose precision as IEEE754 floating points do. When using a JSON library to implement the encoding, decimal values MUST NOT be converted through an IEEE floating point type (e.g. double or float in most programming languages) but must use the native decimal data type. Feature 4: Avro time, date, and duration logical types When encoding data typed with one of Avro's logical data types for dates and times, the data is encoded into and from a JSON `string` value, which is an expression as defined in IETF RFC3339. Specifically, the logical types are mapped to certain grammar elements defined in RFC3339 as defined in the following table: |logicalType|RFC3339 grammar element| |------------------------|-------------------------------------------------------------| |`date`|RFC3339 5.6. “full-date”| |`time-millis`|RFC3339 5.6. “date-time”| |`time-micros`|RFC3339 5.6. “partial-time”| |`timestamp-millis`|RFC3339 5.6 “date-time”| |`timestamp-micros`|RFC3339 5.6 “date-time”| |`local-timestamp-millis`|RFC3339 5.6 “date-time”, ignoring offset (note RFC 3339 4.4)| |`local-timestamp-micros`|RFC3339 5.6 “date-time” , ignoring offset (note RFC 3339 4.4)| |`duration`|RFC3339 Appendix A “duration”| Feature 5: Handling unions with primitive type values and enum values Unions of primitive types and of enum values are handled through JSON values' (RFC8259, Section 3) ability to reflect variable types. Given a type union of `[string, null]` and a string value "test", a encoded field named "example" is encoded as `"example": null` or `"example": "test"`. For null-valued fields, the JSON encoder MAY omit the field entirely. During decoding, missing fields are set to null. If a default value is defined for the field, decoding MUST set the field value to the default value. For a type union of `[string,int]` and string values "2" and the int value 2, a encoded field named "example" is encoded as `"example": "2"` or `"example":2`. For a type union of `[null, myEnum]` with myEnum being an enum type having symbols "test1" and "test2", a encoded field named "example" is encoded as `"example": null` or `"example": "test1"` or `"example": "test2"`. Instances of unions of primitive types with arrays and records or maps can also be distinguished through the JSON grammar and type model. Unions of multiple records are discussed in Feature 6 below. For completeness, these are the updated type mappings of Avro types to JSON types for the plain JSON encoding. |Avro type|JSON type|Notes| |------------|---------|-----------------------------------------------------------------------------------| |null|null|The field MAY be omitted| |boolean|boolean| | |int,long|integer| | |float,double|number| | |bytes|string|Base64 string, see [Feature 2](#feature-2-avro-binary-and-fixed-type-data-encoding)| |string|string| | |record|object| | |enum|string| | |array|array| | |map|object| | |fixed|string|Base64 string, see [Feature 2](#feature-2-avro-binary-and-fixed-type-data-encoding)| |date/time|string|See [Feature 4](#feature-4-avro-time-date-and-duration-logical-types)| |UUID|string| | |decimal|number|See [Feature 3](#feature-3-avro-decimal-logical-type-data-encoding)| Feature 6: Handling unions of record values and of maps As discussed in the overview, JSON does not have an inherent concept of a type-hint that allows distinguishing object data types. Indeed, it has no concept of constraining and further specifying the `object` type, at all. The JSON Schema project has defined a schema language specifically for JSON data and provides a type concept for `object`. In JSON interoperability scenarios, JSON Schema, or frameworks that infer their type concepts from JSON Schema, will often play a role on the producer or consumer side due to its popularity. JSON Schema is primarily a schema model that serves to validate JSON documents. Its "oneOf" type composition construct is equivalent to Avro's union concept in function. Out of a choice of multiple type options, exactly one option MUST match the JSON element that is being validated, otherwise the validation fails. Any implementation of a JSON Schema validator must therefore be able to test the given JSON element against all available options and then determine the matching type option. Any implementation of a schema driven decoder can use the same strategy to select which type to instantiate and populate. JSON Schema does not define a type-hint for this purpose, but makes it the schema designer's task to create type definitions that are structurally distinct such that the "oneOf" test always yields one of the types when given JSON element instances. Schema designers then occasionally resort to introducing their own type-hints by either defining a discriminator property with a single-value `enum` or with a `const` value, where the discriminator property name is the same across the type options, but the values of the `enum` or `const` are different. We will lean on this practice in the following. Type structure matching Consider the following Avro schema with a type union of two record types: ```JSON { "type": "record", "name": "ContactList", "fields": [ { "name": "contacts", "type": "array", "items": [ { "type": "record", "name": "CustomerRecord", "fields": [ {"name": "name", "type": "string"} , {"name": "age", "type": "int"} \{"name": "customerId", "type": "string"} ] }, { "type": "record", "name": "EmployeeRecord", "fields": [ {"name": "name", "type": "string"} , {"name": "age", "type": "int"} , {"name": "employeeId", "type": "string"} ] } ] } ] } ``` Now consider the following JSON document: ```JSON { "contacts": [ {"name": "Alice", "age": 42, "customerId": "1234"} , {"name": "Bob", "age": 43, "employeeId": "5678"} ] } ``` We can clearly distinguish the two record types by the presence of the respectively required `customerId` or `employeeId` field. When decoding a type union, the JSON decoder MUST test the JSON element against all available type options. A JSON element matches if it can be correctly and completely decoded given the type-union candidate schema, including all applicable nested or referenced definitions. If more than one of the options matches, decoding MUST fail. The JSON decoder MUST select the type option that matches the JSON element and instantiate and populate the corresponding type. For performance reasons, it is highly desirable to avoid having to test a JSON element against all possible type options in a union and instead have a single property that can be tested first and short-circuits the type matching process. We discuss that next. Discriminator property When we assume the Avro schema to be slightly different, we might end up with an ambiguity that is not as easy to resolve. Let the `employeeId` and `customerId` fields be optional in the schema above, both typed as `["string", "null"]`. When we now consider the following JSON document, we can't decide on the type and will fail decoding: ```JSON { "contacts": [ {"name": "Alice", "age": 42} , {"name": "Bob", "age": 43} ] } ``` To resolve this ambiguity, we can introduce a discriminator property that clearly identifies the type of the record. Instead of introducing a schema attribute that is specific to JSON, we instead introduce a new Avro schema attribute `const` that defines a constant value for the field it is defined on. The value of the `const` field must match the field type. The value of the field MUST always match the `const` value. During decoding, decoding MUST fail if the field value is not equal to the `const` value. This rule ensures the function of `const` as a discriminator. The `const` field is only allowed on fields of primitive types and enum types. Consider this Avro schema: ```JSON { "type": "record", "name": "ContactList", "fields": [ { "name": "contacts", "type": "array", "items": [ { "type": "record", "name": "CustomerRecord", "fields": [ {"name": "name", "type": "string"} , {"name": "age", "type": "int"} , {"name": "customerId", "type": ["string", "null"]} , {"name": "type", "type": "string", "const": "customer"} ] }, { "type": "record", "name": "EmployeeRecord", "fields": [ {"name": "name", "type": "string"} , {"name": "age", "type": "int"} , {"name": "employeeId", "type": ["string", "null"]} , {"name": "type", "type": "string", "const": "employee"} ] } ] } ] } ``` The JSON document MUST now include the discriminator: ```JSON { "contacts": [ {"name": "Alice", "age": 42, "type": "customer"} , {"name": "Bob", "age": 43, "type": "employee"} ] } ``` The `const` field MAY otherwise be used for any other purpose. The binary decoder MAY skip encoding and decoding a field with a `const` attribute and instead always return the constant value for the field similar to how the `default` field is handled. The `const` value overrides the `default` value. During encoding, the binary encoder SHOULD check that the field value matches the `const` value and MAY fail encoding if it does not. Feature 7: Document root records Avro schemas are defined as a single record or enum type at the top level or as a top-level type union. JSON documents, however, may have top-level arrays and maps. Without changing the fundamental Avro schema model, the plain JSON encoding mode uses an annotation on `array` and `map` types defined inside `record` types to allow for top-level arrays and maps in the JSON document. The annotation is a boolean flag named `root` that is set to `true` on one record field's array or map type. The `root` flag is only defined for `array` and `map` types. If the `root` flag is present and has the value `true`, the enclosing `record` type MUST have exactly this one field. Given a JSON document with a top-level array like this: ```JSON [ {"name": "Alice", "age": 42} , {"name": "Bob", "age": 43} ] ``` The Avro schema would be defined as follows: ```JSON { "type": "record", "name": "PersonDocument", "fields": [ { "name": "persons", "type": { "type": "array", "root": true, "items": { "type": "record", "name": "PersonRecord", "fields": [ {"name": "name", "type": "string"} , {"name": "age", "type": "int"} ] } } } ] } ``` When the JSON decoder encounters a top-level array or map, it MUST match the array or map to the field with the `root` flag set to `true`. When the `root` flag is present on a field, the JSON encoder MUST yield the encoding of the field as the encoding of the entire record. The JSON encoder MUST fail if the `root` flag is set to `true` and if there is more than one field in the record. When such a record type is used as a field type inside another record, it consequently is always represented equivalent to a `map` or `array` type in the JSON document. In [type structure matching](#type-structure-matching) scenarios, a set `root` on a `map` type causes the record type to be a candidate for the type matching of JSON `object` values. The `root` flag on an `array` type causes the record type to be a candidate for the type matching of JSON `array` values. The Avro binary encoding is not functionally affected by this feature, but the structural constraint imposed by the `root` flag MAY be enforced by the encoder. -- This message was sent by Atlassian Jira (v8.20.10#820010)