Hi folks! I put together this specification for canonicalizing the JSON type in Arrow.
## Introduction JSON is a widely used text based data interchange format. There are many use cases where a user has a column whose contents are a JSON encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical Type][2] are two such examples. The JSON specification is defined in [RFC-8259][3]. However, many of the most popular parsers support non standard extensions. Examples of non standard extensions to JSON include comments, unquoted keys, trailing commas, etc. ## Extension Specification * The name of the extension is `arrow.json` * The storage type of the extension is `utf8` * The extension type has no parameters * The metadata MUST be either empty or a valid JSON object - There is no canonical metadata - Implementations MAY include implementation-specific metadata by using a namespaced key. For example `{"google.bigquery": {"my": "metadata"}}` * Implementations... - MUST produce valid UTF-8 encoded text - SHOULD produce valid standard JSON - MAY produce valid non-standard JSON - MUST support parsing standard JSON - MAY support parsing non standard JSON - SHOULD pass through contents that they do not understand ## Forward compatibility In the future we might allow this logical type to annotate a byte storage type with a different text encoding. Implementations consuming JSON logical types should verify this. [1]: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type [2]: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json [3]: https://datatracker.ietf.org/doc/html/rfc8259