Hello, Sorry this hasn't gotten much attention recently. I just brought this up at the Arrow community meeting, as I'd like to revive it.
It looks like there is a draft implementation up already [1]. I'm generally supportive of this, but I have a few questions: 1. Would we be able to make this extension type work on top of any of the string types, including Utf8, LargeUtf8, and the (under consideration [2]) StringView types? 2. Does this imply a potential canonical extension type for every text-based data format, such as HOCON, XML, and so on? If we agree JSON is special, I think it's fine to have its own extension type. On the other hand, it might be worth considering making a generic extension type for serialized data, that is parameterized by the media type ("application/json" in this case). This doesn't preclude the possibility of building an extension type class / struct within Arrow implementations that is specific to JSON; I don't think there's any hard rule that there has to be a 1-1 correspondence between extension types in the format and the concrete data structures in libraries. Best, Will Jones [1] https://github.com/apache/arrow/pull/13901 [2] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt On Thu, Dec 1, 2022 at 12:23 AM Antoine Pitrou <anto...@python.org> wrote: > > HOCON is a superset of JSON, so I'm not sure making it an extension type > based it on JSON would be a good idea. > > > Le 01/12/2022 à 06:23, Micah Kornfield a écrit : > >> > >> Can a logical extension be based on another logical extension? > > > > Potentially but this is mostly an implementation details, each type > should > > have their own specification IMO. > > > > HOCON support might be nice.. > > > > I'm not sure if this is common enough to warrant a canonical type within > > Arrow but you are welcome to propose something if you would like. > > > > Cheers, > > Micah > > > > On Mon, Nov 28, 2022 at 11:55 AM Lee, David <david....@blackrock.com > .invalid> > > wrote: > > > >> Can a logical extension be based on another logical extension? > >> > >> HOCON support might be nice.. > >> > >> -----Original Message----- > >> From: Micah Kornfield <emkornfi...@gmail.com> > >> Sent: Monday, November 28, 2022 11:50 AM > >> To: dev@arrow.apache.org > >> Subject: Re: [DISCUSS] JSON Canonical Extension Type > >> > >> External Email: Use caution with links and attachments > >> > >> > >> This seems like a reasonable definition to me. Since there hasn't been > >> much feedback, I think maybe following through an implementation + this > >> description in a PR would be the next steps. If there isn't further > >> feedback on this, once the PR is up we can have try to vote (which might > >> bring up some more feedback, but hopefully wouldn't cause too much > >> implementation churn). > >> > >> Thanks, > >> Micah > >> > >> On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota > >> <pgollak...@google.com.invalid> wrote: > >> > >>> Hi folks! > >>> > >>> I put together this specification for canonicalizing the JSON type in > >>> Arrow. > >>> > >>> ## Introduction > >>> JSON is a widely used text based data interchange format. There are > >>> many use cases where a user has a column whose contents are a JSON > >>> encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical > >>> Type][2] are two such examples. > >>> > >>> The JSON specification is defined in [RFC-8259][3]. However, many of > >>> the most popular parsers support non standard extensions. Examples of > >>> non standard extensions to JSON include comments, unquoted keys, > >>> trailing commas, etc. > >>> > >>> ## Extension Specification > >>> * The name of the extension is `arrow.json` > >>> * The storage type of the extension is `utf8` > >>> * The extension type has no parameters > >>> * The metadata MUST be either empty or a valid JSON object > >>> - There is no canonical metadata > >>> - Implementations MAY include implementation-specific metadata by > >>> using a namespaced key. For example `{"google.bigquery": {"my": > >>> "metadata"}}` > >>> * Implementations... > >>> - MUST produce valid UTF-8 encoded text > >>> - SHOULD produce valid standard JSON > >>> - MAY produce valid non-standard JSON > >>> - MUST support parsing standard JSON > >>> - MAY support parsing non standard JSON > >>> - SHOULD pass through contents that they do not understand > >>> > >>> ## Forward compatibility > >>> In the future we might allow this logical type to annotate a byte > >>> storage type with a different text encoding. Implementations > >>> consuming JSON logical types should verify this. > >>> > >>> [1]: > >>> > >>> > >> > https://urldefense.com/v3/__https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types*json_type__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8UMqTxPY$ > >>> [2]: > >>> > >> > https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8RFfD8NY$ > >>> [3]: > >>> > >> > https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc8259__;!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8MGoes7Q$ > >>> > >> > >> > >> This message may contain information that is confidential or privileged. > >> If you are not the intended recipient, please advise the sender > immediately > >> and delete this message. See > >> http://www.blackrock.com/corporate/compliance/email-disclaimers for > >> further information. Please refer to > >> http://www.blackrock.com/corporate/compliance/privacy-policy for more > >> information about BlackRock’s Privacy Policy. > >> > >> > >> For a list of BlackRock's office addresses worldwide, see > >> http://www.blackrock.com/corporate/about-us/contacts-locations. > >> > >> © 2022 BlackRock, Inc. All rights reserved. > >> > > >