I filed ARROW-17268 [1] for the JSON parse/extract/serialize kernels. (Though 
probably this would get broken up across multiple tickets.)

[1]: https://issues.apache.org/jira/browse/ARROW-17268

-David

On Sat, Jul 30, 2022, at 11:06, Neal Richardson wrote:
> Sounds good to me too. +1 on the canonical extension type option; maybe it
> should end up as a first-class type, but I'd like to see us try it without
> first and see what that tells us about the path for having an extension
> type get promoted to being a first-class type. This is something that has
> been discussed in principle before, but I don't know we've worked out what
> it would look like in practice.
>
> I spoke with someone at the RStudio conference this week who requested this
> type as well. Relatedly, there is a gap in the C++ library where we don't
> have compute functions for JSON parsing and serializing, it's only in the
> JSON file reader (and in test utilities etc.). So if you get data that has
> a column of JSON strings, you can't do anything with it (unless both my
> memory and grep fail me).
>
> Neal
>
> On Fri, Jul 29, 2022 at 7:03 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> I think either path:
>>
>> * Canonical extension type
>> * First-class type in the Type union in Flatbuffers
>>
>> would be OK. The canonical extension type option is the preferable
>> path here, I think, because it allows Arrow implementations without
>> any special handling for JSON to allow the data to pass through as
>> Binary or String. Implementations like C++ could see the extension
>> type metadata and construct an instance of arrow::Type::JSON /
>> JsonArray, etc., but when it gets serialized back to Parquet or Arrow
>> IPC it looks like binary/string (since JSON can be utf-16/utf-32,
>> right?) with additional field metadata.
>>
>> On Fri, Jul 29, 2022 at 5:56 PM Pradeep Gollakota
>> <pgollak...@google.com.invalid> wrote:
>> >
>> > Thanks Micah!
>> >
>> > That's certainly one option we could use. It would likely be easier to
>> > implement at the outset. I wonder if something like arrow::json() would
>> > open up more options down the line.
>> >
>> > This brings up an interesting question of whether Parquet logical types
>> > should have a 1:1 mapping with Arrow logical types. Would we also want an
>> > arrow::bson()? I wouldn't think so. Maybe
>> > arrow::json({encoding=string/bson})? I'm not sure which would be better
>> if
>> > we want to enable compute engines to manipulate the JSON data.
>> >
>> > On Fri, Jul 29, 2022 at 6:38 PM Micah Kornfield <emkornfi...@gmail.com>
>> > wrote:
>> >
>> > > Just to be clear, I think we are referring to a "well known"/canonical
>> > > extension type [1] here?   I'd also be in favor of this (Disclaimer
>> I'm a
>> > > colleague of Padeep's)
>> > >
>> > > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
>> > >
>> > >
>> > > On Fri, Jul 29, 2022 at 3:19 PM Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> > >
>> > > > This seems like a common-enough data type that having a first-class
>> > > > logical type would be a good idea (perhaps even more so than UUID!).
>> > > > Compute engines would be able to implement kernels that provide
>> > > > manipulations of JSON data similar to what you can do with jq or
>> > > > GraphQL.
>> > > >
>> > > > On Fri, Jul 29, 2022 at 1:43 PM Pradeep Gollakota
>> > > > <pgollak...@google.com.invalid> wrote:
>> > > > >
>> > > > > Hi Team!
>> > > > >
>> > > > > I filed ARROW-17255 to support the JSON logical type in Arrow.
>> > > Initially
>> > > > > I'm only interested in C++ support that wraps a string. I imagine
>> that
>> > > as
>> > > > > Arrow and Parquet get more sophisticated, we might want to do more
>> > > > > interesting things (shredding?) with the JSON.
>> > > > >
>> > > > > David mentioned that there have been discussions around other
>> "common"
>> > > > > extensions like UUID. Is this something that the community would be
>> > > > > interested in? My goal at the moment is to be able to export data
>> from
>> > > > > BigQuery to Parquet with the correct LogicalType set in the
>> exported
>> > > > files.
>> > > > >
>> > > > > Thanks!
>> > > > > Pradeep
>> > > >
>> > >
>> >
>> >
>> > --
>> > Pradeep
>>

Reply via email to