Hi Pradeep,

Thanks for filing this PR!

Before merging this PR, I think we should discuss a bit what a canonical extension type is, and how it gets standardized. I'll make a separate discussion thread.

Regards

Antoine.


Le 16/08/2022 à 22:40, Pradeep Gollakota a écrit :
Hi all,

I've created a pull request introducing a canonical extension type as
discussed in this thread. https://github.com/apache/arrow/pull/13901

Thanks for all the input!

On Wed, Aug 3, 2022 at 10:46 AM Antoine Pitrou <anto...@python.org> wrote:



Le 03/08/2022 à 16:19, Lee, David a écrit :

There are probably two ways to approach this.

Physically store the json as a UTF8 string

Or

Physically store the json as nested lists and structs.

This works if all JSON values follow a predefined schema, which is not
necessarily the case.

In any case, this proposal is about the former approach (store the JSON
as a UTF8 string). Arrow already supports the latter approach if your
use case is amenable to it.

Regards

Antoine.





   This is more complicated and ideally this method would also support
including json schemas to help address missing values and round trip
conversions. https://json-schema.org/

Sent from my iPad

On Aug 2, 2022, at 11:23 PM, Lee, David <david....@blackrock.com.invalid>
wrote:

External Email: Use caution with links and attachments


While I do like having a json type, adding processing functionality
especially around compute capabilities might be limiting.

Arrow already supports nested lists and structs which can cover json
structures while offering vectorized processing. Json should only be a
logical representation of what arrow physically supports today.

A bad example is Snowflake semi structured data support. They have a
Java engine for tabular data and a JavaScript engine for json data. The JS
engine is a second class citizen that requires a lot of compute to string
parse json data before json content can be filtered, sorted, aggregated,
etc..

Sent from my iPad

On Aug 2, 2022, at 11:38 AM, Wes McKinney <wesmck...@gmail.com> wrote:

External Email: Use caution with links and attachments


I should add that since Parquet has JSON, BSON, and UUID types, that
while UUID is just a simple fixed sized binary, that having the
extension types so that the metadata flows through accurately to
Parquet would be net beneficial:


https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$

Implementing JSON (and BSON and UUID if we want them) as extension
types and restricting JSON to UTF-8 sounds good to me.

On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:


2. What do we do about different non-utf8 encodings? There does not
appear
to be a consensus yet on this point. One option is to only allow utf8
encoding and force implementers to convert non-utf8 to utf8. Second
option
is to allow all encodings and capture the encoding in the metadata (I'm
leaning towards this option).


Allowing non-utf8 encodings adds complexity for everyone. Disallowing
them only adds complexity for the tiny minority of producers of non-utf8
JSON.


I'd also add that if we only allow extension on utf8 today, it would be a
forward/backward compatible change to allow parameterizing the extension
for bytes type by encoding if we wanted to support it in the future.
Parquet also only supports UTF-8 [1] for its logical JSON type.

[1]

https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$

On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <anto...@python.org>
wrote:


Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
Thanks for all the great feedback.

To proceed forward, we seem to need decisions around the following:

1. Whether to use arrow extensions or first class types. The consensus is
building towards using arrow extensions.

+1

2. What do we do about different non-utf8 encodings? There does not
appear
to be a consensus yet on this point. One option is to only allow utf8
encoding and force implementers to convert non-utf8 to utf8. Second
option
is to allow all encodings and capture the encoding in the metadata (I'm
leaning towards this option).

Allowing non-utf8 encodings adds complexity for everyone. Disallowing
them only adds complexity for the tiny minority of producers of non-utf8
JSON.

3. What do we do about the different formats of JSON (string, BSON,
UBJSON,
etc.)?

There are no "different formats of JSON". BSON etc. are unrelated
formats.

Regards

Antoine.



This message may contain information that is confidential or privileged.
If you are not the intended recipient, please advise the sender immediately
and delete this message. See
http://www.blackrock.com/corporate/compliance/email-disclaimers for
further information.  Please refer to
http://www.blackrock.com/corporate/compliance/privacy-policy for more
information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2022 BlackRock, Inc. All rights reserved.



Reply via email to