Le 03/08/2022 à 16:19, Lee, David a écrit :

There are probably two ways to approach this.

Physically store the json as a UTF8 string

Or

Physically store the json as nested lists and structs.

This works if all JSON values follow a predefined schema, which is not necessarily the case.

In any case, this proposal is about the former approach (store the JSON as a UTF8 string). Arrow already supports the latter approach if your use case is amenable to it.

Regards

Antoine.





This is more complicated and ideally this method would also support including json schemas to help address missing values and round trip conversions. https://json-schema.org/

Sent from my iPad

On Aug 2, 2022, at 11:23 PM, Lee, David <david....@blackrock.com.invalid> wrote:

External Email: Use caution with links and attachments


While I do like having a json type, adding processing functionality especially 
around compute capabilities might be limiting.

Arrow already supports nested lists and structs which can cover json structures 
while offering vectorized processing. Json should only be a logical 
representation of what arrow physically supports today.

A bad example is Snowflake semi structured data support. They have a Java 
engine for tabular data and a JavaScript engine for json data. The JS engine is 
a second class citizen that requires a lot of compute to string parse json data 
before json content can be filtered, sorted, aggregated, etc..

Sent from my iPad

On Aug 2, 2022, at 11:38 AM, Wes McKinney <wesmck...@gmail.com> wrote:

External Email: Use caution with links and attachments


I should add that since Parquet has JSON, BSON, and UUID types, that
while UUID is just a simple fixed sized binary, that having the
extension types so that the metadata flows through accurately to
Parquet would be net beneficial:

https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$

Implementing JSON (and BSON and UUID if we want them) as extension
types and restricting JSON to UTF-8 sounds good to me.

On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <emkornfi...@gmail.com> wrote:


2. What do we do about different non-utf8 encodings? There does not
appear
to be a consensus yet on this point. One option is to only allow utf8
encoding and force implementers to convert non-utf8 to utf8. Second
option
is to allow all encodings and capture the encoding in the metadata (I'm
leaning towards this option).


Allowing non-utf8 encodings adds complexity for everyone. Disallowing
them only adds complexity for the tiny minority of producers of non-utf8
JSON.


I'd also add that if we only allow extension on utf8 today, it would be a
forward/backward compatible change to allow parameterizing the extension
for bytes type by encoding if we wanted to support it in the future.
Parquet also only supports UTF-8 [1] for its logical JSON type.

[1]
https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$

On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <anto...@python.org> wrote:


Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
Thanks for all the great feedback.

To proceed forward, we seem to need decisions around the following:

1. Whether to use arrow extensions or first class types. The consensus is
building towards using arrow extensions.

+1

2. What do we do about different non-utf8 encodings? There does not
appear
to be a consensus yet on this point. One option is to only allow utf8
encoding and force implementers to convert non-utf8 to utf8. Second
option
is to allow all encodings and capture the encoding in the metadata (I'm
leaning towards this option).

Allowing non-utf8 encodings adds complexity for everyone. Disallowing
them only adds complexity for the tiny minority of producers of non-utf8
JSON.

3. What do we do about the different formats of JSON (string, BSON,
UBJSON,
etc.)?

There are no "different formats of JSON". BSON etc. are unrelated formats.

Regards

Antoine.



This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2022 BlackRock, Inc. All rights reserved.

Reply via email to