Hi Pradeep,
Thanks for filing this PR!
Before merging this PR, I think we should discuss a bit what a canonical
extension type is, and how it gets standardized. I'll make a separate
discussion thread.
Regards
Antoine.
Le 16/08/2022 à 22:40, Pradeep Gollakota a écrit :
Hi all,
I've
Le 03/08/2022 à 16:19, Lee, David a écrit :
There are probably two ways to approach this.
Physically store the json as a UTF8 string
Or
Physically store the json as nested lists and structs.
This works if all JSON values follow a predefined schema, which is not
necessarily the case.
I think, from a compute perspective, one would just cast before doing
anything. So you wouldn't need much beyond parse and unparse. For
example, if you have a JSON document and you want to know the largest
value of $.weather.temperature then you could do...
There are probably two ways to approach this.
Physically store the json as a UTF8 string
Or
Physically store the json as nested lists and structs. This is more complicated
and ideally this method would also support including json schemas to help
address missing values and round trip
While I do like having a json type, adding processing functionality especially
around compute capabilities might be limiting.
Arrow already supports nested lists and structs which can cover json structures
while offering vectorized processing. Json should only be a logical
representation of
I should add that since Parquet has JSON, BSON, and UUID types, that
while UUID is just a simple fixed sized binary, that having the
extension types so that the metadata flows through accurately to
Parquet would be net beneficial:
>
> > 2. What do we do about different non-utf8 encodings? There does not
> appear
> > to be a consensus yet on this point. One option is to only allow utf8
> > encoding and force implementers to convert non-utf8 to utf8. Second
> option
> > is to allow all encodings and capture the encoding in
Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
Thanks for all the great feedback.
To proceed forward, we seem to need decisions around the following:
1. Whether to use arrow extensions or first class types. The consensus is
building towards using arrow extensions.
+1
2. What do we do
>
> It would be reasonable to restrict JSON to utf8, and tell people they
> need to transcode in the rare cases where some obnoxious software
> outputs utf16-encoded JSON.
+1 I think this aligns with the latest JSON RFC [1] as well.
Sounds good to me too. +1 on the canonical extension type
Le 30/07/2022 à 01:02, Wes McKinney a écrit :
I think either path:
* Canonical extension type
* First-class type in the Type union in Flatbuffers
would be OK. The canonical extension type option is the preferable
path here, I think, because it allows Arrow implementations without
any special
wrote:
>> > >
>> > > > This seems like a common-enough data type that having a first-class
>> > > > logical type would be a good idea (perhaps even more so than UUID!).
>> > > > Compute engines would be able to implement kernels that provide
>>
e engines would be able to implement kernels that provide
> > > > manipulations of JSON data similar to what you can do with jq or
> > > > GraphQL.
> > > >
> > > > On Fri, Jul 29, 2022 at 1:43 PM Pradeep Gollakota
> > > > wro
irst-class
> > > logical type would be a good idea (perhaps even more so than UUID!).
> > > Compute engines would be able to implement kernels that provide
> > > manipulations of JSON data similar to what you can do with jq or
> > > GraphQL.
> > >
> &
GraphQL.
>
> On Fri, Jul 29, 2022 at 1:43 PM Pradeep Gollakota
> wrote:
> >
> > Hi Team!
> >
> > I filed ARROW-17255 to support the JSON logical type in Arrow. Initially
> > I'm only interested in C++ support that wraps a string. I imagine that as
> &
at 1:43 PM Pradeep Gollakota
wrote:
>
> Hi Team!
>
> I filed ARROW-17255 to support the JSON logical type in Arrow. Initially
> I'm only interested in C++ support that wraps a string. I imagine that as
> Arrow and Parquet get more sophisticated, we might want to do more
&g
15 matches
Mail list logo