Re: [DISCUSS] JSON Canonical Extension Type
Hello, Sorry this hasn't gotten much attention recently. I just brought this up at the Arrow community meeting, as I'd like to revive it. It looks like there is a draft implementation up already [1]. I'm generally supportive of this, but I have a few questions: 1. Would we be able to make this extension type work on top of any of the string types, including Utf8, LargeUtf8, and the (under consideration [2]) StringView types? 2. Does this imply a potential canonical extension type for every text-based data format, such as HOCON, XML, and so on? If we agree JSON is special, I think it's fine to have its own extension type. On the other hand, it might be worth considering making a generic extension type for serialized data, that is parameterized by the media type ("application/json" in this case). This doesn't preclude the possibility of building an extension type class / struct within Arrow implementations that is specific to JSON; I don't think there's any hard rule that there has to be a 1-1 correspondence between extension types in the format and the concrete data structures in libraries. Best, Will Jones [1] https://github.com/apache/arrow/pull/13901 [2] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt On Thu, Dec 1, 2022 at 12:23 AM Antoine Pitrou wrote: > > HOCON is a superset of JSON, so I'm not sure making it an extension type > based it on JSON would be a good idea. > > > Le 01/12/2022 à 06:23, Micah Kornfield a écrit : > >> > >> Can a logical extension be based on another logical extension? > > > > Potentially but this is mostly an implementation details, each type > should > > have their own specification IMO. > > > > HOCON support might be nice.. > > > > I'm not sure if this is common enough to warrant a canonical type within > > Arrow but you are welcome to propose something if you would like. > > > > Cheers, > > Micah > > > > On Mon, Nov 28, 2022 at 11:55 AM Lee, David .invalid> > > wrote: > > > >> Can a logical extension be based on another logical extension? > >> > >> HOCON support might be nice.. > >> > >> -Original Message- > >> From: Micah Kornfield > >> Sent: Monday, November 28, 2022 11:50 AM > >> To: dev@arrow.apache.org > >> Subject: Re: [DISCUSS] JSON Canonical Extension Type > >> > >> External Email: Use caution with links and attachments > >> > >> > >> This seems like a reasonable definition to me. Since there hasn't been > >> much feedback, I think maybe following through an implementation + this > >> description in a PR would be the next steps. If there isn't further > >> feedback on this, once the PR is up we can have try to vote (which might > >> bring up some more feedback, but hopefully wouldn't cause too much > >> implementation churn). > >> > >> Thanks, > >> Micah > >> > >> On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota > >> wrote: > >> > >>> Hi folks! > >>> > >>> I put together this specification for canonicalizing the JSON type in > >>> Arrow. > >>> > >>> ## Introduction > >>> JSON is a widely used text based data interchange format. There are > >>> many use cases where a user has a column whose contents are a JSON > >>> encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical > >>> Type][2] are two such examples. > >>> > >>> The JSON specification is defined in [RFC-8259][3]. However, many of > >>> the most popular parsers support non standard extensions. Examples of > >>> non standard extensions to JSON include comments, unquoted keys, > >>> trailing commas, etc. > >>> > >>> ## Extension Specification > >>> * The name of the extension is `arrow.json` > >>> * The storage type of the extension is `utf8` > >>> * The extension type has no parameters > >>> * The metadata MUST be either empty or a valid JSON object > >>> - There is no canonical metadata > >>> - Implementations MAY include implementation-specific metadata by > >>> using a namespaced key. For example `{"google.bigquery": {"my": > >>> "metadata"}}` > >>> * Implementations... > >>> - MUST produce valid UTF-8 encoded text > >>> - SHOULD produce valid standard JSON > >>> - MAY produce valid non-standard JSON > >>> - MUST support parsing standard JSO
Re: [DISCUSS] JSON Canonical Extension Type
HOCON is a superset of JSON, so I'm not sure making it an extension type based it on JSON would be a good idea. Le 01/12/2022 à 06:23, Micah Kornfield a écrit : Can a logical extension be based on another logical extension? Potentially but this is mostly an implementation details, each type should have their own specification IMO. HOCON support might be nice.. I'm not sure if this is common enough to warrant a canonical type within Arrow but you are welcome to propose something if you would like. Cheers, Micah On Mon, Nov 28, 2022 at 11:55 AM Lee, David wrote: Can a logical extension be based on another logical extension? HOCON support might be nice.. -Original Message- From: Micah Kornfield Sent: Monday, November 28, 2022 11:50 AM To: dev@arrow.apache.org Subject: Re: [DISCUSS] JSON Canonical Extension Type External Email: Use caution with links and attachments This seems like a reasonable definition to me. Since there hasn't been much feedback, I think maybe following through an implementation + this description in a PR would be the next steps. If there isn't further feedback on this, once the PR is up we can have try to vote (which might bring up some more feedback, but hopefully wouldn't cause too much implementation churn). Thanks, Micah On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota wrote: Hi folks! I put together this specification for canonicalizing the JSON type in Arrow. ## Introduction JSON is a widely used text based data interchange format. There are many use cases where a user has a column whose contents are a JSON encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical Type][2] are two such examples. The JSON specification is defined in [RFC-8259][3]. However, many of the most popular parsers support non standard extensions. Examples of non standard extensions to JSON include comments, unquoted keys, trailing commas, etc. ## Extension Specification * The name of the extension is `arrow.json` * The storage type of the extension is `utf8` * The extension type has no parameters * The metadata MUST be either empty or a valid JSON object - There is no canonical metadata - Implementations MAY include implementation-specific metadata by using a namespaced key. For example `{"google.bigquery": {"my": "metadata"}}` * Implementations... - MUST produce valid UTF-8 encoded text - SHOULD produce valid standard JSON - MAY produce valid non-standard JSON - MUST support parsing standard JSON - MAY support parsing non standard JSON - SHOULD pass through contents that they do not understand ## Forward compatibility In the future we might allow this logical type to annotate a byte storage type with a different text encoding. Implementations consuming JSON logical types should verify this. [1]: https://urldefense.com/v3/__https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types*json_type__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8UMqTxPY$ [2]: https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8RFfD8NY$ [3]: https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc8259__;!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8MGoes7Q$ This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2022 BlackRock, Inc. All rights reserved.
Re: [DISCUSS] JSON Canonical Extension Type
> > Can a logical extension be based on another logical extension? Potentially but this is mostly an implementation details, each type should have their own specification IMO. HOCON support might be nice.. I'm not sure if this is common enough to warrant a canonical type within Arrow but you are welcome to propose something if you would like. Cheers, Micah On Mon, Nov 28, 2022 at 11:55 AM Lee, David wrote: > Can a logical extension be based on another logical extension? > > HOCON support might be nice.. > > -Original Message- > From: Micah Kornfield > Sent: Monday, November 28, 2022 11:50 AM > To: dev@arrow.apache.org > Subject: Re: [DISCUSS] JSON Canonical Extension Type > > External Email: Use caution with links and attachments > > > This seems like a reasonable definition to me. Since there hasn't been > much feedback, I think maybe following through an implementation + this > description in a PR would be the next steps. If there isn't further > feedback on this, once the PR is up we can have try to vote (which might > bring up some more feedback, but hopefully wouldn't cause too much > implementation churn). > > Thanks, > Micah > > On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota > wrote: > > > Hi folks! > > > > I put together this specification for canonicalizing the JSON type in > > Arrow. > > > > ## Introduction > > JSON is a widely used text based data interchange format. There are > > many use cases where a user has a column whose contents are a JSON > > encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical > > Type][2] are two such examples. > > > > The JSON specification is defined in [RFC-8259][3]. However, many of > > the most popular parsers support non standard extensions. Examples of > > non standard extensions to JSON include comments, unquoted keys, > > trailing commas, etc. > > > > ## Extension Specification > > * The name of the extension is `arrow.json` > > * The storage type of the extension is `utf8` > > * The extension type has no parameters > > * The metadata MUST be either empty or a valid JSON object > > - There is no canonical metadata > > - Implementations MAY include implementation-specific metadata by > > using a namespaced key. For example `{"google.bigquery": {"my": > > "metadata"}}` > > * Implementations... > > - MUST produce valid UTF-8 encoded text > > - SHOULD produce valid standard JSON > > - MAY produce valid non-standard JSON > > - MUST support parsing standard JSON > > - MAY support parsing non standard JSON > > - SHOULD pass through contents that they do not understand > > > > ## Forward compatibility > > In the future we might allow this logical type to annotate a byte > > storage type with a different text encoding. Implementations > > consuming JSON logical types should verify this. > > > > [1]: > > > > > https://urldefense.com/v3/__https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types*json_type__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8UMqTxPY$ > > [2]: > > > https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8RFfD8NY$ > > [3]: > > > https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc8259__;!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8MGoes7Q$ > > > > > This message may contain information that is confidential or privileged. > If you are not the intended recipient, please advise the sender immediately > and delete this message. See > http://www.blackrock.com/corporate/compliance/email-disclaimers for > further information. Please refer to > http://www.blackrock.com/corporate/compliance/privacy-policy for more > information about BlackRock’s Privacy Policy. > > > For a list of BlackRock's office addresses worldwide, see > http://www.blackrock.com/corporate/about-us/contacts-locations. > > © 2022 BlackRock, Inc. All rights reserved. >
RE: [DISCUSS] JSON Canonical Extension Type
Can a logical extension be based on another logical extension? HOCON support might be nice.. -Original Message- From: Micah Kornfield Sent: Monday, November 28, 2022 11:50 AM To: dev@arrow.apache.org Subject: Re: [DISCUSS] JSON Canonical Extension Type External Email: Use caution with links and attachments This seems like a reasonable definition to me. Since there hasn't been much feedback, I think maybe following through an implementation + this description in a PR would be the next steps. If there isn't further feedback on this, once the PR is up we can have try to vote (which might bring up some more feedback, but hopefully wouldn't cause too much implementation churn). Thanks, Micah On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota wrote: > Hi folks! > > I put together this specification for canonicalizing the JSON type in > Arrow. > > ## Introduction > JSON is a widely used text based data interchange format. There are > many use cases where a user has a column whose contents are a JSON > encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical > Type][2] are two such examples. > > The JSON specification is defined in [RFC-8259][3]. However, many of > the most popular parsers support non standard extensions. Examples of > non standard extensions to JSON include comments, unquoted keys, > trailing commas, etc. > > ## Extension Specification > * The name of the extension is `arrow.json` > * The storage type of the extension is `utf8` > * The extension type has no parameters > * The metadata MUST be either empty or a valid JSON object > - There is no canonical metadata > - Implementations MAY include implementation-specific metadata by > using a namespaced key. For example `{"google.bigquery": {"my": > "metadata"}}` > * Implementations... > - MUST produce valid UTF-8 encoded text > - SHOULD produce valid standard JSON > - MAY produce valid non-standard JSON > - MUST support parsing standard JSON > - MAY support parsing non standard JSON > - SHOULD pass through contents that they do not understand > > ## Forward compatibility > In the future we might allow this logical type to annotate a byte > storage type with a different text encoding. Implementations > consuming JSON logical types should verify this. > > [1]: > > https://urldefense.com/v3/__https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types*json_type__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8UMqTxPY$ > [2]: > https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8RFfD8NY$ > [3]: > https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc8259__;!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8MGoes7Q$ > This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2022 BlackRock, Inc. All rights reserved.
Re: [DISCUSS] JSON Canonical Extension Type
This seems like a reasonable definition to me. Since there hasn't been much feedback, I think maybe following through an implementation + this description in a PR would be the next steps. If there isn't further feedback on this, once the PR is up we can have try to vote (which might bring up some more feedback, but hopefully wouldn't cause too much implementation churn). Thanks, Micah On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota wrote: > Hi folks! > > I put together this specification for canonicalizing the JSON type in > Arrow. > > ## Introduction > JSON is a widely used text based data interchange format. There are many > use cases where a user has a column whose contents are a JSON encoded > string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical Type][2] are > two such examples. > > The JSON specification is defined in [RFC-8259][3]. However, many of the > most popular parsers support non standard extensions. Examples of non > standard extensions to JSON include comments, unquoted keys, trailing > commas, etc. > > ## Extension Specification > * The name of the extension is `arrow.json` > * The storage type of the extension is `utf8` > * The extension type has no parameters > * The metadata MUST be either empty or a valid JSON object > - There is no canonical metadata > - Implementations MAY include implementation-specific metadata by using > a namespaced key. For example `{"google.bigquery": {"my": "metadata"}}` > * Implementations... > - MUST produce valid UTF-8 encoded text > - SHOULD produce valid standard JSON > - MAY produce valid non-standard JSON > - MUST support parsing standard JSON > - MAY support parsing non standard JSON > - SHOULD pass through contents that they do not understand > > ## Forward compatibility > In the future we might allow this logical type to annotate a byte storage > type with a different text encoding. Implementations consuming JSON > logical types should verify this. > > [1]: > > https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type > [2]: > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json > [3]: https://datatracker.ietf.org/doc/html/rfc8259 >