Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-17 Thread Antoine Pitrou
Hi Pradeep, Thanks for filing this PR! Before merging this PR, I think we should discuss a bit what a canonical extension type is, and how it gets standardized. I'll make a separate discussion thread. Regards Antoine. Le 16/08/2022 à 22:40, Pradeep Gollakota a écrit : Hi all, I've

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-03 Thread Antoine Pitrou
Le 03/08/2022 à 16:19, Lee, David a écrit : There are probably two ways to approach this. Physically store the json as a UTF8 string Or Physically store the json as nested lists and structs. This works if all JSON values follow a predefined schema, which is not necessarily the case.

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-03 Thread Weston Pace
I think, from a compute perspective, one would just cast before doing anything. So you wouldn't need much beyond parse and unparse. For example, if you have a JSON document and you want to know the largest value of $.weather.temperature then you could do...

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-03 Thread Lee, David
There are probably two ways to approach this. Physically store the json as a UTF8 string Or Physically store the json as nested lists and structs. This is more complicated and ideally this method would also support including json schemas to help address missing values and round trip

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-03 Thread Lee, David
While I do like having a json type, adding processing functionality especially around compute capabilities might be limiting. Arrow already supports nested lists and structs which can cover json structures while offering vectorized processing. Json should only be a logical representation of

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-02 Thread Wes McKinney
I should add that since Parquet has JSON, BSON, and UUID types, that while UUID is just a simple fixed sized binary, that having the extension types so that the metadata flows through accurately to Parquet would be net beneficial:

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-02 Thread Micah Kornfield
> > > 2. What do we do about different non-utf8 encodings? There does not > appear > > to be a consensus yet on this point. One option is to only allow utf8 > > encoding and force implementers to convert non-utf8 to utf8. Second > option > > is to allow all encodings and capture the encoding in

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-02 Thread Antoine Pitrou
Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit : Thanks for all the great feedback. To proceed forward, we seem to need decisions around the following: 1. Whether to use arrow extensions or first class types. The consensus is building towards using arrow extensions. +1 2. What do we do

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-01 Thread Micah Kornfield
> > It would be reasonable to restrict JSON to utf8, and tell people they > need to transcode in the rare cases where some obnoxious software > outputs utf16-encoded JSON. +1 I think this aligns with the latest JSON RFC [1] as well. Sounds good to me too. +1 on the canonical extension type

Re: [ARROW-17255] Logical JSON type in Arrow

2022-07-30 Thread Antoine Pitrou
Le 30/07/2022 à 01:02, Wes McKinney a écrit : I think either path: * Canonical extension type * First-class type in the Type union in Flatbuffers would be OK. The canonical extension type option is the preferable path here, I think, because it allows Arrow implementations without any special

Re: [ARROW-17255] Logical JSON type in Arrow

2022-07-30 Thread David Li
wrote: >> > > >> > > > This seems like a common-enough data type that having a first-class >> > > > logical type would be a good idea (perhaps even more so than UUID!). >> > > > Compute engines would be able to implement kernels that provide >>

Re: [ARROW-17255] Logical JSON type in Arrow

2022-07-30 Thread Neal Richardson
e engines would be able to implement kernels that provide > > > > manipulations of JSON data similar to what you can do with jq or > > > > GraphQL. > > > > > > > > On Fri, Jul 29, 2022 at 1:43 PM Pradeep Gollakota > > > > wro

Re: [ARROW-17255] Logical JSON type in Arrow

2022-07-29 Thread Wes McKinney
irst-class > > > logical type would be a good idea (perhaps even more so than UUID!). > > > Compute engines would be able to implement kernels that provide > > > manipulations of JSON data similar to what you can do with jq or > > > GraphQL. > > > > &

Re: [ARROW-17255] Logical JSON type in Arrow

2022-07-29 Thread Micah Kornfield
GraphQL. > > On Fri, Jul 29, 2022 at 1:43 PM Pradeep Gollakota > wrote: > > > > Hi Team! > > > > I filed ARROW-17255 to support the JSON logical type in Arrow. Initially > > I'm only interested in C++ support that wraps a string. I imagine that as > &

Re: [ARROW-17255] Logical JSON type in Arrow

2022-07-29 Thread Wes McKinney
at 1:43 PM Pradeep Gollakota wrote: > > Hi Team! > > I filed ARROW-17255 to support the JSON logical type in Arrow. Initially > I'm only interested in C++ support that wraps a string. I imagine that as > Arrow and Parquet get more sophisticated, we might want to do more &g