Hi Pradeep,
Thanks for filing this PR!
Before merging this PR, I think we should discuss a bit what a canonical
extension type is, and how it gets standardized. I'll make a separate
discussion thread.
Regards
Antoine.
Le 16/08/2022 à 22:40, Pradeep Gollakota a écrit :
Hi all,
I've cre
Le 03/08/2022 à 16:19, Lee, David a écrit :
There are probably two ways to approach this.
Physically store the json as a UTF8 string
Or
Physically store the json as nested lists and structs.
This works if all JSON values follow a predefined schema, which is not
necessarily the case.
I
I think, from a compute perspective, one would just cast before doing
anything. So you wouldn't need much beyond parse and unparse. For
example, if you have a JSON document and you want to know the largest
value of $.weather.temperature then you could do...
MAX(STRUCT_FIELD(PARSE_JSON("json_col"
There are probably two ways to approach this.
Physically store the json as a UTF8 string
Or
Physically store the json as nested lists and structs. This is more complicated
and ideally this method would also support including json schemas to help
address missing values and round trip conversio
While I do like having a json type, adding processing functionality especially
around compute capabilities might be limiting.
Arrow already supports nested lists and structs which can cover json structures
while offering vectorized processing. Json should only be a logical
representation of wh
I should add that since Parquet has JSON, BSON, and UUID types, that
while UUID is just a simple fixed sized binary, that having the
extension types so that the metadata flows through accurately to
Parquet would be net beneficial:
https://github.com/apache/parquet-format/blob/master/src/main/thrif
>
> > 2. What do we do about different non-utf8 encodings? There does not
> appear
> > to be a consensus yet on this point. One option is to only allow utf8
> > encoding and force implementers to convert non-utf8 to utf8. Second
> option
> > is to allow all encodings and capture the encoding in the
Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
Thanks for all the great feedback.
To proceed forward, we seem to need decisions around the following:
1. Whether to use arrow extensions or first class types. The consensus is
building towards using arrow extensions.
+1
2. What do we do
>
> It would be reasonable to restrict JSON to utf8, and tell people they
> need to transcode in the rare cases where some obnoxious software
> outputs utf16-encoded JSON.
+1 I think this aligns with the latest JSON RFC [1] as well.
Sounds good to me too. +1 on the canonical extension type option
Le 30/07/2022 à 01:02, Wes McKinney a écrit :
I think either path:
* Canonical extension type
* First-class type in the Type union in Flatbuffers
would be OK. The canonical extension type option is the preferable
path here, I think, because it allows Arrow implementations without
any special
I filed ARROW-17268 [1] for the JSON parse/extract/serialize kernels. (Though
probably this would get broken up across multiple tickets.)
[1]: https://issues.apache.org/jira/browse/ARROW-17268
-David
On Sat, Jul 30, 2022, at 11:06, Neal Richardson wrote:
> Sounds good to me too. +1 on the canon
Sounds good to me too. +1 on the canonical extension type option; maybe it
should end up as a first-class type, but I'd like to see us try it without
first and see what that tells us about the path for having an extension
type get promoted to being a first-class type. This is something that has
bee
I think either path:
* Canonical extension type
* First-class type in the Type union in Flatbuffers
would be OK. The canonical extension type option is the preferable
path here, I think, because it allows Arrow implementations without
any special handling for JSON to allow the data to pass throug
Just to be clear, I think we are referring to a "well known"/canonical
extension type [1] here? I'd also be in favor of this (Disclaimer I'm a
colleague of Padeep's)
[1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
On Fri, Jul 29, 2022 at 3:19 PM Wes McKinney wrote:
> T
This seems like a common-enough data type that having a first-class
logical type would be a good idea (perhaps even more so than UUID!).
Compute engines would be able to implement kernels that provide
manipulations of JSON data similar to what you can do with jq or
GraphQL.
On Fri, Jul 29, 2022 at
15 matches
Mail list logo