[jira] [Created] (AVRO-3986) "Plain JSON" encoding for Apache Avro

Clemens Vasters (Jira) Thu, 02 May 2024 05:50:56 -0700

Clemens Vasters created AVRO-3986:
-------------------------------------

             Summary: "Plain JSON" encoding for Apache Avro
                 Key: AVRO-3986
                 URL: https://issues.apache.org/jira/browse/AVRO-3986
             Project: Apache Avro
          Issue Type: New Feature
          Components: interop
            Reporter: Clemens Vasters

Markdown version of this text:
[https://gist.github.com/clemensv/8145234add81633d4a21817b1e134a82]
- [Notational Conventions](#notational-conventions)
- [Interoperability issues of the Avro JSON Encoding with common JSON
usage](#interoperability-issues-of-the-avro-json-encoding-with-common-json-usage)
- [The "Plain JSON" encoding](#the-plain-json-encoding)

The Apache Avro project defines a JSON Encoding, which is optimized for encoding
data in JSON, but primarily aimed at exchanging data between implementations of
the Apache Avro specification. The choices made for this encoding severely limit
the interoperability with other JSON serialization frameworks. This document
defines an alternate, additional mode for Avro JSON Encoders, preliminarily
named "Plain JSON", that specifically addresses identified interoperability
blockers.

While this document is a proposal for a set of new features in Apache Avro, the
extensibility of Avro's schema model allows for the implementation of these
features separately from the Avro project. Out of the available and popular
schema languages for data exchange, Avro schema provides the cleanest foundation
for mapping wire representations to programming language types and database
tables, which is why interoperability of Avro with the most popular text
encoding format for structured data, JSON, is very desirable.

With Avro's strength and focus being its binary encoding, supporting JSON is
specifically desireable in interoperability scenarios where either the producer
or the consumer of the encoded data is using a different JSON encoding
framework, or where JSON is crafted or evaluated directly by the application.

As most JSON document instances can be structurally described by Avro Schema,
the interoperability case is for JSON data, described by Avro Schema, to be
accepted by an Apache Avro messaging application, and for that data then to be
forwarded onwards using Avro binary encoding. Reversely, it needs to be possible
for an application to transform an Avro binary encoded data structure into JSON
data that is understood by parties that expect to handle JSON. The kinds
applications requiring such transformation capabilities are stream processing
frameworks, API gateways and (reverse) proxies, and integration brokers.

The intent of this proposal is for the Avro "JsonEncoder" implementations to
have a new mode parameter, accepting an enumeration choice out of the options
"Avro Json" (AVRO_JSON, AvroJson, etc), which is Avro's default JSON encoding,
and "Plain JSON" (PLAIN_JSON, PlainJson, etc). The rules for the "Plain JSON"
mode are described herein.

The "Plain JSON" mode is a selector for enabling set of features that are
described below. Implementations MAY also choose for these features to be
individually selectable for the "Avro JSON" mode, for instance letting the user
use the "Avro JSON" mode primarily, but opting into the binary data handling or
date-time handling features described here. However, the "Plain JSON" mode that
combines these features MUST be implemented to ensure interoperability.

Notational Conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
interpreted as described in RFC 2119.

Interoperability issues of the Avro JSON Encoding with common JSON usage

There are several distinct issues in the Avro JSON Encoding that cause conflicts
with common usage of JSON and many serialization frameworks. It needs to be
emphasized that none of these issues are conformance issues with the JSON
specification (RFC8259), but rather stem from the JSON specification's inherent
limitations. JSON does not define binary data, date or time types. JSON also has
no concept of a type-hint for data structures (i.e. objects), which would allow
serialization frameworks to establish an unambiguous mapping between a data type
in a programming language or schema and the encoded type in JSON.

There are, however, commonly used conventions to address these shortcomings of
the core JSON specification:
- Binary data: Binary data is commonly encoded using the base64 encoding and
stored in string-typed values.
- Date and time data: Date and time data is commonly encoded using the RFC3339
profile of ISO8601 and stored in string-typed values.
- Type hints: In its native type system, JSON value types are distinguished by
notation where 'null' values, strings, numbers, booleans, arrays, and objects
are identifiable through the syntax. While JSON has no further data type
concepts, several serialization frameworks and even some standards leaning on
JSON (e.g. OpenAPI) introduce the notion of a "discriminator" property, which
is inside the encoded object and unambiguously identifies the type such
that the decoding stage can instantiate and populate the correct type
in cases where multiple candidate types exist.

On each of these items, the Avro JSON encoding's choices are in direct conflict
with predominant practice:
- Binary data: Binary data is encoded in strings using Unicode escape sequences
(example: "\u00DE\u00AD\u00BE\u00EF"), which leads to a 500% overhead compared
to the encoded bytes vs. a 33% overhead when using Base64.
- Date and time data: Avro handles date and time as logical types, extending
either long or int, using the UNIX epoch as the baseline. Durations are
expressed using a bespoke data structure. As there are no handling rules for
logical types in the JSON encoding, the encoded results are therefore epoch
numbers without annotations like time zone offsets.
- Type-hints: Whenever types can be ambiguous in Avro, which is the case with
type unions, the Avro JSON encoding prescribes encoding the value wrapped
inside an object with a single property where the property's name is the type
name, e.g. `"myprop": \{"string": "value"}
`. 'null' values are encoded as
'null', e.g. `"myprop": null`. For primitive types, this is in conflict with
JSON's native type model that already makes the distinction syntactically. For
object types (Avro records), the wrapper is in conflict with standing practice
where the discriminator is inlined.

In addition, there are three general limitations of Avro's type and schema model
that result in potential interoperability blockers:
- Avro represents decimal numeric types as a logical type annotating `fixed` or
`byte`, which results in an encoded byte sequence in the JSON encoding that
cannot be interpreted without the Avro schema and is therefore undecipherable
for regular JSON consumers.
- `name` fields in Avro are limited to a character set that can be easily
mapped
to mostly any programming language and database, but JSON object keys are not.
- JSON documents may have top-level arrays and maps, while Avro schemas only
allow `record` and `enum` as independent types and therefore at the top-level
of a schema.

As a consequence of this, the current implementations of the Avro JSON Encoding
do not interoperate well with "plain JSON" as input and often do not yield
useful plain JSON as output. There is a "happy path" on which the Avro JSON
Encoding does line up with common usage, but it's easy to stray off from it.

The Plain JSON encoding

The Plain JSON encoding mode of Apache Avro consists of a combination of 7
distinct features that are defined in this section. The design is grounded in
the relevant IETF RFCs and provides the broadest interoperability with common
usage of JSON, while yet preserving type integrity and precision in all cases
where the Avro Schema is known to the decoding party.

The features are designed to be orthogonal and can be implemented separately.
- [1: Alternate names](#feature-1-alternate-names)
- [2: Avro `binary` and `fixed` type data
encoding](#feature-2-avro-binary-and-fixed-type-data-encoding)
- [3: Avro `decimal` logical type data
encoding](#feature-3-avro-decimal-logical-type-data-encoding)
- [4: Avro time, date, and duration logical
types](#feature-4-avro-time-date-and-duration-logical-types)
- [5: Handling unions with primitive type values and enum
values](#feature-5-handling-unions-with-primitive-type-values-and-enum-values)
- [6: Handling unions of record values and of
maps](#feature-6-handling-unions-of-record-values-and-of-maps)
- [7: Document root records](#feature-7-document-root-records)

Features 2, 3, 4, and 5 are trivial on all platforms and frameworks that handle
JSON. Features 1 and 7 are hints for the JSON encoder and decoder to be able to
handle JSON data that is not conforming to Avro's naming and structure
constraints. Feature 6 provides a mechanism to handle unions of record types
that is aligned with common JSON encodation frameworks and JSON Schema's "oneOf"
type composition.

Feature 1: Alternate names

JSON objects allow for keys with arbitrary unicode strings, with the only
restriction being uniqueness of keys within an object. Uniqueness is a "SHOULD"
rule in [RFC8259, Section
4]([https://www.rfc-editor.org/rfc/rfc8259#section-4]),
which is interpreted as REQUIRED for this specification since it is common
practice.

The character set permitted for Avro names is constrained by the regular
expression `[A-Za-z_][A-Za-z0-9_]*`, which poses an interoperability problem
with JSON, especially in scenarios where internationalization is a concern.
While English is the dominant language in most developer scenarios, metadata
might be defined by end-users and in their own language. It's also fairly common
for JSON object keys to contain word-separator characters other than '_' and
keys may quite well start with a number.

As the Avro project will presumably want to avoid introducing schema attributes
that are JSON-specific and will want to use new schema constructs for additional
needs as they arise, the alternate names feature introduces a map of alternate
names of which the plain JSON feature reserves a key:

`altnames` map

Wherever Avro Schema requires a `name` field, an `altnames` map MAY be defined
alongside the `name` field, which provides a map of alternate names. Those names
may be local-language identifiers, display names, or names that contain
characters disallowed in Avro. The map key identifies the context in which the
alternate name is used.

This specification reserves the `json` key in the `altnames` map.

> A display-name feature might reserve `display:

{IANA-subtag}

` as keys. This
> assumed convention is used in the following example just for illustration of
> the
> `altnames` feature.

Assume the following JSON input document with German-language keys that
represents a row in commercial order document:

```JSON
{
"Artikelschlüssel": "1234",
"Stückzahl": 42,
"Größe": "Extragroß"
}
```

Without the alternate names feature, the Avro schema would not be able to match
the keys in the JSON document since `ü` and `ß` are not allowed. With the
alternate names feature, the schema can be defined as follows:

```JSON
{
"type": "record",
"namespace": "com.example",
"name": "Article",
"fields": [
{
"name": "articleKey",
"type": "string",
"altnames":

{ "json": "Artikelschlüssel", "display:de": "Artikelschlüssel", "display:en":
"Article Key" }

},
{
"name": "quantity",
"type": "int",
"altnames":

{ "json": "Stückzahl", "display:de": "Stückzahl", "display:en": "Quantity" }

},
{
"name": "size",
"type": "sizeEnum",
"altnames":

{ "json": "Größe", "display:de": "Größe", "display:en": "Size" }

}
]
}
```

When the JSON decoder (de-)encodes a named item, the encoder MUST use the
value from the `altnames` entry with the `json` key as the name for the
corresponding JSON element, when present

`altsymbols` map

The `altsymbols` map is a similar feature to `altnames`, but it is used for
alternate names of enum symbols. The `altsymbols` map provides alternate names
for symbols. As with `altnames`, the `altsymbols` map key identifies the context
in which the alternate name is used. The values of the `altsymbols` map are maps
where the keys are symbols as defined in the `symbols` field and the values are
the corresponding alternate names.

Any symbol key present in the `altsymbols` map MUST exist in the `symbols`
field. Symbols in the `symbols` field MAY be omitted from the `altsymbols` map.

```JSON
{
"type": "enum",
"name": "sizeEnum",
"symbols": ["S", "M", "L", "XL"],
"altsymbols": {
"json":

{ "S": "Klein", "M": "Mittel", "L": "Groß", "XL": "Extragroß" }

,
"display:en":

{ "S": "Small", "M": "Medium", "L": "Large", "XL": "Extra Large" }

}
}
```

When the JSON decoder (de-)encodes an enum symbol, the encoder MUST use the
value from the `altsymbols` entry with the `json` key as the string representing
the enum value, when present.

Feature 2: Avro `binary` and `fixed` type data encoding

When encoding data typed with the Avro `binary` or `fixed` types, the byte
sequence is encoded into and from Base64 encoded string values, conforming with
IETF RFC4648, Section 4.

Feature 3: Avro `decimal` logical type data encoding

When encoding data typed with the Avro logical `decimal` type, the numeric
value is encoded into a from a JSON `number` value. JSON numbers are represented
as text and do not lose precision as IEEE754 floating points do.

When using a JSON library to implement the encoding, decimal values MUST NOT be
converted through an IEEE floating point type (e.g. double or float in most
programming languages) but must use the native decimal data type.

Feature 4: Avro time, date, and duration logical types

When encoding data typed with one of Avro's logical data types for dates and
times, the data is encoded into and from a JSON `string` value, which is an
expression as defined in IETF RFC3339.

Specifically, the logical types are mapped to certain grammar elements defined
in RFC3339 as defined in the following table:
|logicalType|RFC3339 grammar element|
|------------------------|-------------------------------------------------------------|
|`date`|RFC3339 5.6. “full-date”|
|`time-millis`|RFC3339 5.6. “date-time”|
|`time-micros`|RFC3339 5.6. “partial-time”|
|`timestamp-millis`|RFC3339 5.6 “date-time”|
|`timestamp-micros`|RFC3339 5.6 “date-time”|
|`local-timestamp-millis`|RFC3339 5.6 “date-time”, ignoring offset (note RFC
3339 4.4)|
|`local-timestamp-micros`|RFC3339 5.6 “date-time” , ignoring offset (note RFC
3339 4.4)|
|`duration`|RFC3339 Appendix A “duration”|

Feature 5: Handling unions with primitive type values and enum values

Unions of primitive types and of enum values are handled through JSON values'
(RFC8259, Section 3) ability to reflect variable types.

Given a type union of `[string, null]` and a string value "test", a encoded
field named "example" is encoded as `"example": null` or `"example": "test"`.
For null-valued fields, the JSON encoder MAY omit the field entirely. During
decoding, missing fields are set to null. If a default value is defined for the
field, decoding MUST set the field value to the default value.

For a type union of `[string,int]` and string values "2" and the int value 2, a
encoded field named "example" is encoded as `"example": "2"`
or `"example":2`.

For a type union of `[null, myEnum]` with myEnum being an enum type having
symbols "test1" and "test2", a encoded field named "example" is encoded as
`"example": null` or `"example": "test1"` or `"example": "test2"`.

Instances of unions of primitive types with arrays and records or maps can also
be distinguished through the JSON grammar and type model. Unions of multiple
records are discussed in Feature 6 below.

For completeness, these are the updated type mappings of Avro types to JSON
types for the plain JSON encoding.
|Avro type|JSON type|Notes|
|------------|---------|-----------------------------------------------------------------------------------|
|null|null|The field MAY be omitted|
|boolean|boolean| |
|int,long|integer| |
|float,double|number| |
|bytes|string|Base64 string, see [Feature
2](#feature-2-avro-binary-and-fixed-type-data-encoding)|
|string|string| |
|record|object| |
|enum|string| |
|array|array| |
|map|object| |
|fixed|string|Base64 string, see [Feature
2](#feature-2-avro-binary-and-fixed-type-data-encoding)|
|date/time|string|See [Feature
4](#feature-4-avro-time-date-and-duration-logical-types)|
|UUID|string| |
|decimal|number|See [Feature
3](#feature-3-avro-decimal-logical-type-data-encoding)|

Feature 6: Handling unions of record values and of maps

As discussed in the overview, JSON does not have an inherent concept of a
type-hint that allows distinguishing object data types. Indeed, it has no
concept of constraining and further specifying the `object` type, at all.

The JSON Schema project has defined a schema language specifically for JSON data
and provides a type concept for `object`. In JSON interoperability scenarios,
JSON Schema, or frameworks that infer their type concepts from JSON Schema, will
often play a role on the producer or consumer side due to its popularity.

JSON Schema is primarily a schema model that serves to validate JSON documents.
Its "oneOf" type composition construct is equivalent to Avro's union concept in
function. Out of a choice of multiple type options, exactly one option MUST
match the JSON element that is being validated, otherwise the validation fails.
Any implementation of a JSON Schema validator must therefore be able to test the
given JSON element against all available options and then determine the matching
type option. Any implementation of a schema driven decoder can use the
same strategy to select which type to instantiate and populate.

JSON Schema does not define a type-hint for this purpose, but makes it the
schema designer's task to create type definitions that are structurally distinct
such that the "oneOf" test always yields one of the types when given JSON
element instances. Schema designers then occasionally resort to introducing
their own type-hints by either defining a discriminator property with a
single-value `enum` or with a `const` value, where the discriminator property
name is the same across the type options, but the values of the `enum` or
`const` are different. We will lean on this practice in the following.

Type structure matching

Consider the following Avro schema with a type union of two record types:

```JSON
{
"type": "record",
"name": "ContactList",
"fields": [
{
"name": "contacts",
"type": "array",
"items": [
{
"type": "record",
"name": "CustomerRecord",
"fields": [

{"name": "name", "type": "string"}

{"name": "age", "type": "int"} \{"name": "customerId", "type": "string"}

]
},
{
"type": "record",
"name": "EmployeeRecord",
"fields": [

{"name": "name", "type": "string"}

{"name": "age", "type": "int"}

{"name": "employeeId", "type": "string"}

]
}
]
}
]
}
```

Now consider the following JSON document:

```JSON
{
"contacts": [

{"name": "Alice", "age": 42, "customerId": "1234"}

{"name": "Bob", "age": 43, "employeeId": "5678"}

]
}
```

We can clearly distinguish the two record types by the presence of the
respectively required `customerId` or `employeeId` field.

When decoding a type union, the JSON decoder MUST test the JSON element against
all available type options. A JSON element matches if it can be correctly and
completely decoded given the type-union candidate schema, including all
applicable nested or referenced definitions. If more than one of the options
matches, decoding MUST fail. The JSON decoder MUST select the type option that
matches the JSON element and instantiate and populate the corresponding type.

For performance reasons, it is highly desirable to avoid having to test a JSON
element against all possible type options in a union and instead have a single
property that can be tested first and short-circuits the type matching process.
We discuss that next.

Discriminator property

When we assume the Avro schema to be slightly different, we might end up with an
ambiguity that is not as easy to resolve. Let the `employeeId` and `customerId`
fields
be optional in the schema above, both typed as `["string", "null"]`.

When we now consider the following JSON document, we can't decide on the type
and will fail decoding:

```JSON
{
"contacts": [

{"name": "Alice", "age": 42}

{"name": "Bob", "age": 43}

]
}
```

To resolve this ambiguity, we can introduce a discriminator property that
clearly identifies the type of the record.

Instead of introducing a schema attribute that is specific to JSON, we instead
introduce a new Avro schema attribute `const` that defines a constant value for
the field it is defined on.

The value of the `const` field must match the field type. The value of the field
MUST always match the `const` value. During decoding, decoding MUST fail if the
field value is not equal to the `const` value. This rule ensures the function of
`const` as a discriminator. The `const` field is only allowed on fields of
primitive types and enum types.

Consider this Avro schema:

```JSON
{
"type": "record",
"name": "ContactList",
"fields": [
{
"name": "contacts",
"type": "array",
"items": [
{
"type": "record",
"name": "CustomerRecord",
"fields": [

{"name": "name", "type": "string"}

{"name": "age", "type": "int"}

{"name": "customerId", "type": ["string", "null"]}

{"name": "type", "type": "string", "const": "customer"}

]
},
{
"type": "record",
"name": "EmployeeRecord",
"fields": [

{"name": "name", "type": "string"}

{"name": "age", "type": "int"}

{"name": "employeeId", "type": ["string", "null"]}

{"name": "type", "type": "string", "const": "employee"}

]
}
]
}
]
}
```

The JSON document MUST now include the discriminator:

```JSON
{
"contacts": [

{"name": "Alice", "age": 42, "type": "customer"}

{"name": "Bob", "age": 43, "type": "employee"}

]
}
```

The `const` field MAY otherwise be used for any other purpose. The binary
decoder MAY skip encoding and decoding a field with a `const` attribute and
instead always return the constant value for the field similar to how the
`default` field is handled. The `const` value overrides the `default` value.
During encoding, the binary encoder SHOULD check that the field value matches
the `const` value and MAY fail encoding if it does not.

Feature 7: Document root records

Avro schemas are defined as a single record or enum type at the top level or as
a top-level type union. JSON documents, however, may have top-level arrays and
maps. Without changing the fundamental Avro schema model, the plain JSON
encoding mode uses an annotation on `array` and `map` types defined inside
`record` types to allow for top-level arrays and maps in the JSON document.

The annotation is a boolean flag named `root` that is set to `true` on one
record field's array or map type. The `root` flag is only defined for `array`
and `map` types. If the `root` flag is present and has the value `true`, the
enclosing `record` type MUST have exactly this one field.

Given a JSON document with a top-level array like this:

```JSON
[

{"name": "Alice", "age": 42}

{"name": "Bob", "age": 43}

]
```

The Avro schema would be defined as follows:

```JSON
{
"type": "record",
"name": "PersonDocument",
"fields": [
{
"name": "persons",
"type": {
"type": "array",
"root": true,
"items": {
"type": "record",
"name": "PersonRecord",
"fields": [

{"name": "name", "type": "string"}

{"name": "age", "type": "int"}

]
}
}
}
]
}
```

When the JSON decoder encounters a top-level array or map, it MUST match the
array or map to the field with the `root` flag set to `true`. When the `root`
flag is present on a field, the JSON encoder MUST yield the encoding of the
field as the encoding of the entire record. The JSON encoder MUST fail if the
`root` flag is set to `true` and if there is more than one field in the record.

When such a record type is used as a field type inside another record, it
consequently is always represented equivalent to a `map` or `array` type in the
JSON document.

In [type structure matching](#type-structure-matching) scenarios, a set `root`
on a `map` type causes the record type to be a candidate for the type matching
of JSON `object` values. The `root` flag on an `array` type causes the record
type to be a candidate for the type matching of JSON `array` values.

The Avro binary encoding is not functionally affected by this feature, but the
structural constraint imposed by the `root` flag MAY be enforced by the encoder.

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (AVRO-3986) "Plain JSON" encoding for Apache Avro

Reply via email to