I filed ARROW-17268 [1] for the JSON parse/extract/serialize kernels. (Though probably this would get broken up across multiple tickets.)
[1]: https://issues.apache.org/jira/browse/ARROW-17268 -David On Sat, Jul 30, 2022, at 11:06, Neal Richardson wrote: > Sounds good to me too. +1 on the canonical extension type option; maybe it > should end up as a first-class type, but I'd like to see us try it without > first and see what that tells us about the path for having an extension > type get promoted to being a first-class type. This is something that has > been discussed in principle before, but I don't know we've worked out what > it would look like in practice. > > I spoke with someone at the RStudio conference this week who requested this > type as well. Relatedly, there is a gap in the C++ library where we don't > have compute functions for JSON parsing and serializing, it's only in the > JSON file reader (and in test utilities etc.). So if you get data that has > a column of JSON strings, you can't do anything with it (unless both my > memory and grep fail me). > > Neal > > On Fri, Jul 29, 2022 at 7:03 PM Wes McKinney <wesmck...@gmail.com> wrote: > >> I think either path: >> >> * Canonical extension type >> * First-class type in the Type union in Flatbuffers >> >> would be OK. The canonical extension type option is the preferable >> path here, I think, because it allows Arrow implementations without >> any special handling for JSON to allow the data to pass through as >> Binary or String. Implementations like C++ could see the extension >> type metadata and construct an instance of arrow::Type::JSON / >> JsonArray, etc., but when it gets serialized back to Parquet or Arrow >> IPC it looks like binary/string (since JSON can be utf-16/utf-32, >> right?) with additional field metadata. >> >> On Fri, Jul 29, 2022 at 5:56 PM Pradeep Gollakota >> <pgollak...@google.com.invalid> wrote: >> > >> > Thanks Micah! >> > >> > That's certainly one option we could use. It would likely be easier to >> > implement at the outset. I wonder if something like arrow::json() would >> > open up more options down the line. >> > >> > This brings up an interesting question of whether Parquet logical types >> > should have a 1:1 mapping with Arrow logical types. Would we also want an >> > arrow::bson()? I wouldn't think so. Maybe >> > arrow::json({encoding=string/bson})? I'm not sure which would be better >> if >> > we want to enable compute engines to manipulate the JSON data. >> > >> > On Fri, Jul 29, 2022 at 6:38 PM Micah Kornfield <emkornfi...@gmail.com> >> > wrote: >> > >> > > Just to be clear, I think we are referring to a "well known"/canonical >> > > extension type [1] here? I'd also be in favor of this (Disclaimer >> I'm a >> > > colleague of Padeep's) >> > > >> > > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types >> > > >> > > >> > > On Fri, Jul 29, 2022 at 3:19 PM Wes McKinney <wesmck...@gmail.com> >> wrote: >> > > >> > > > This seems like a common-enough data type that having a first-class >> > > > logical type would be a good idea (perhaps even more so than UUID!). >> > > > Compute engines would be able to implement kernels that provide >> > > > manipulations of JSON data similar to what you can do with jq or >> > > > GraphQL. >> > > > >> > > > On Fri, Jul 29, 2022 at 1:43 PM Pradeep Gollakota >> > > > <pgollak...@google.com.invalid> wrote: >> > > > > >> > > > > Hi Team! >> > > > > >> > > > > I filed ARROW-17255 to support the JSON logical type in Arrow. >> > > Initially >> > > > > I'm only interested in C++ support that wraps a string. I imagine >> that >> > > as >> > > > > Arrow and Parquet get more sophisticated, we might want to do more >> > > > > interesting things (shredding?) with the JSON. >> > > > > >> > > > > David mentioned that there have been discussions around other >> "common" >> > > > > extensions like UUID. Is this something that the community would be >> > > > > interested in? My goal at the moment is to be able to export data >> from >> > > > > BigQuery to Parquet with the correct LogicalType set in the >> exported >> > > > files. >> > > > > >> > > > > Thanks! >> > > > > Pradeep >> > > > >> > > >> > >> > >> > -- >> > Pradeep >>