Hi folks!
I put together this specification for canonicalizing the JSON type in Arrow.
## Introduction
JSON is a widely used text based data interchange format. There are many
use cases where a user has a column whose contents are a JSON encoded
string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical Type][2] are
two such examples.
The JSON specification is defined in [RFC-8259][3]. However, many of the
most popular parsers support non standard extensions. Examples of non
standard extensions to JSON include comments, unquoted keys, trailing
commas, etc.
## Extension Specification
* The name of the extension is `arrow.json`
* The storage type of the extension is `utf8`
* The extension type has no parameters
* The metadata MUST be either empty or a valid JSON object
- There is no canonical metadata
- Implementations MAY include implementation-specific metadata by using
a namespaced key. For example `{"google.bigquery": {"my": "metadata"}}`
* Implementations...
- MUST produce valid UTF-8 encoded text
- SHOULD produce valid standard JSON
- MAY produce valid non-standard JSON
- MUST support parsing standard JSON
- MAY support parsing non standard JSON
- SHOULD pass through contents that they do not understand
## Forward compatibility
In the future we might allow this logical type to annotate a byte storage
type with a different text encoding. Implementations consuming JSON
logical types should verify this.
[1]:
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type
[2]:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json
[3]: https://datatracker.ietf.org/doc/html/rfc8259