Re: [ARROW-17255] Logical JSON type in Arrow

Weston Pace Wed, 03 Aug 2022 07:44:32 -0700

I think, from a compute perspective, one would just cast before doing
anything.  So you wouldn't need much beyond parse and unparse.  For
example, if you have a JSON document and you want to know the largest
value of $.weather.temperature then you could do...


MAX(STRUCT_FIELD(PARSE_JSON("json_col"), "weather.temperature"))

You could maybe add support for a JSONPath aware parsing mechanism so
then you could do something like...

MAX(PARSE_JSON("json_col", "$.weather.temperature"))

On Wed, Aug 3, 2022 at 4:20 AM Lee, David
<david....@blackrock.com.invalid> wrote:
>
>
> There are probably two ways to approach this.
>
> Physically store the json as a UTF8 string
>
> Or
>
> Physically store the json as nested lists and structs. This is more 
> complicated and ideally this method would also support including json schemas 
> to help address missing values and round trip conversions. 
> https://json-schema.org/
>
> Sent from my iPad
>
> On Aug 2, 2022, at 11:23 PM, Lee, David <david....@blackrock.com.invalid> 
> wrote:
>
> External Email: Use caution with links and attachments
>
>
> While I do like having a json type, adding processing functionality 
> especially around compute capabilities might be limiting.
>
> Arrow already supports nested lists and structs which can cover json 
> structures while offering vectorized processing. Json should only be a 
> logical representation of what arrow physically supports today.
>
> A bad example is Snowflake semi structured data support. They have a Java 
> engine for tabular data and a JavaScript engine for json data. The JS engine 
> is a second class citizen that requires a lot of compute to string parse json 
> data before json content can be filtered, sorted, aggregated, etc..
>
> Sent from my iPad
>
> On Aug 2, 2022, at 11:38 AM, Wes McKinney <wesmck...@gmail.com> wrote:
>
> External Email: Use caution with links and attachments
>
>
> I should add that since Parquet has JSON, BSON, and UUID types, that
> while UUID is just a simple fixed sized binary, that having the
> extension types so that the metadata flows through accurately to
> Parquet would be net beneficial:
>
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$
>
> Implementing JSON (and BSON and UUID if we want them) as extension
> types and restricting JSON to UTF-8 sounds good to me.
>
> On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
>
> 2. What do we do about different non-utf8 encodings? There does not
> appear
> to be a consensus yet on this point. One option is to only allow utf8
> encoding and force implementers to convert non-utf8 to utf8. Second
> option
> is to allow all encodings and capture the encoding in the metadata (I'm
> leaning towards this option).
>
>
> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> them only adds complexity for the tiny minority of producers of non-utf8
> JSON.
>
>
> I'd also add that if we only allow extension on utf8 today, it would be a
> forward/backward compatible change to allow parameterizing the extension
> for bytes type by encoding if we wanted to support it in the future.
> Parquet also only supports UTF-8 [1] for its logical JSON type.
>
> [1]
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$
>
> On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
> Thanks for all the great feedback.
>
> To proceed forward, we seem to need decisions around the following:
>
> 1. Whether to use arrow extensions or first class types. The consensus is
> building towards using arrow extensions.
>
> +1
>
> 2. What do we do about different non-utf8 encodings? There does not
> appear
> to be a consensus yet on this point. One option is to only allow utf8
> encoding and force implementers to convert non-utf8 to utf8. Second
> option
> is to allow all encodings and capture the encoding in the metadata (I'm
> leaning towards this option).
>
> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> them only adds complexity for the tiny minority of producers of non-utf8
> JSON.
>
> 3. What do we do about the different formats of JSON (string, BSON,
> UBJSON,
> etc.)?
>
> There are no "different formats of JSON". BSON etc. are unrelated formats.
>
> Regards
>
> Antoine.
>
>
>
> This message may contain information that is confidential or privileged. If 
> you are not the intended recipient, please advise the sender immediately and 
> delete this message. See 
> http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
> information.  Please refer to 
> http://www.blackrock.com/corporate/compliance/privacy-policy for more 
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see 
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2022 BlackRock, Inc. All rights reserved.

Re: [ARROW-17255] Logical JSON type in Arrow

Reply via email to