I think, from a compute perspective, one would just cast before doing anything. So you wouldn't need much beyond parse and unparse. For example, if you have a JSON document and you want to know the largest value of $.weather.temperature then you could do...
MAX(STRUCT_FIELD(PARSE_JSON("json_col"), "weather.temperature")) You could maybe add support for a JSONPath aware parsing mechanism so then you could do something like... MAX(PARSE_JSON("json_col", "$.weather.temperature")) On Wed, Aug 3, 2022 at 4:20 AM Lee, David <david....@blackrock.com.invalid> wrote: > > > There are probably two ways to approach this. > > Physically store the json as a UTF8 string > > Or > > Physically store the json as nested lists and structs. This is more > complicated and ideally this method would also support including json schemas > to help address missing values and round trip conversions. > https://json-schema.org/ > > Sent from my iPad > > On Aug 2, 2022, at 11:23 PM, Lee, David <david....@blackrock.com.invalid> > wrote: > > External Email: Use caution with links and attachments > > > While I do like having a json type, adding processing functionality > especially around compute capabilities might be limiting. > > Arrow already supports nested lists and structs which can cover json > structures while offering vectorized processing. Json should only be a > logical representation of what arrow physically supports today. > > A bad example is Snowflake semi structured data support. They have a Java > engine for tabular data and a JavaScript engine for json data. The JS engine > is a second class citizen that requires a lot of compute to string parse json > data before json content can be filtered, sorted, aggregated, etc.. > > Sent from my iPad > > On Aug 2, 2022, at 11:38 AM, Wes McKinney <wesmck...@gmail.com> wrote: > > External Email: Use caution with links and attachments > > > I should add that since Parquet has JSON, BSON, and UUID types, that > while UUID is just a simple fixed sized binary, that having the > extension types so that the metadata flows through accurately to > Parquet would be net beneficial: > > https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$ > > Implementing JSON (and BSON and UUID if we want them) as extension > types and restricting JSON to UTF-8 sounds good to me. > > On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > 2. What do we do about different non-utf8 encodings? There does not > appear > to be a consensus yet on this point. One option is to only allow utf8 > encoding and force implementers to convert non-utf8 to utf8. Second > option > is to allow all encodings and capture the encoding in the metadata (I'm > leaning towards this option). > > > Allowing non-utf8 encodings adds complexity for everyone. Disallowing > them only adds complexity for the tiny minority of producers of non-utf8 > JSON. > > > I'd also add that if we only allow extension on utf8 today, it would be a > forward/backward compatible change to allow parameterizing the extension > for bytes type by encoding if we wanted to support it in the future. > Parquet also only supports UTF-8 [1] for its logical JSON type. > > [1] > https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$ > > On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <anto...@python.org> wrote: > > > Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit : > Thanks for all the great feedback. > > To proceed forward, we seem to need decisions around the following: > > 1. Whether to use arrow extensions or first class types. The consensus is > building towards using arrow extensions. > > +1 > > 2. What do we do about different non-utf8 encodings? There does not > appear > to be a consensus yet on this point. One option is to only allow utf8 > encoding and force implementers to convert non-utf8 to utf8. Second > option > is to allow all encodings and capture the encoding in the metadata (I'm > leaning towards this option). > > Allowing non-utf8 encodings adds complexity for everyone. Disallowing > them only adds complexity for the tiny minority of producers of non-utf8 > JSON. > > 3. What do we do about the different formats of JSON (string, BSON, > UBJSON, > etc.)? > > There are no "different formats of JSON". BSON etc. are unrelated formats. > > Regards > > Antoine. > > > > This message may contain information that is confidential or privileged. If > you are not the intended recipient, please advise the sender immediately and > delete this message. See > http://www.blackrock.com/corporate/compliance/email-disclaimers for further > information. Please refer to > http://www.blackrock.com/corporate/compliance/privacy-policy for more > information about BlackRock’s Privacy Policy. > > > For a list of BlackRock's office addresses worldwide, see > http://www.blackrock.com/corporate/about-us/contacts-locations. > > © 2022 BlackRock, Inc. All rights reserved.