[DISCUSS] JSON Canonical Extension Type

Pradeep Gollakota Thu, 17 Nov 2022 15:58:22 -0800

Hi folks!

I put together this specification for canonicalizing the JSON type in Arrow.


## Introduction
JSON is a widely used text based data interchange format. There are many
use cases where a user has a column whose contents are a JSON encoded
string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical Type][2] are
two such examples.

The JSON specification is defined in [RFC-8259][3]. However, many of the
most popular parsers support non standard extensions. Examples of non
standard extensions to JSON include comments, unquoted keys, trailing
commas, etc.

## Extension Specification
* The name of the extension is `arrow.json`
* The storage type of the extension is `utf8`
* The extension type has no parameters
* The metadata MUST be either empty or a valid JSON object
    - There is no canonical metadata
    - Implementations MAY include implementation-specific metadata by using
a namespaced key. For example `{"google.bigquery": {"my": "metadata"}}`
* Implementations...
    - MUST produce valid UTF-8 encoded text
    - SHOULD produce valid standard JSON
    - MAY produce valid non-standard JSON
    - MUST support parsing standard JSON
    - MAY support parsing non standard JSON
    - SHOULD pass through contents that they do not understand

## Forward compatibility
In the future we might allow this logical type to annotate a byte storage
type with a different text encoding.  Implementations consuming JSON
logical types should verify this.

    [1]:
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type
    [2]:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json
    [3]: https://datatracker.ietf.org/doc/html/rfc8259

[DISCUSS] JSON Canonical Extension Type

Reply via email to