I should add that since Parquet has JSON, BSON, and UUID types, that
while UUID is just a simple fixed sized binary, that having the
extension types so that the metadata flows through accurately to
Parquet would be net beneficial:

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L342

Implementing JSON (and BSON and UUID if we want them) as extension
types and restricting JSON to UTF-8 sounds good to me.

On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> >
> > > 2. What do we do about different non-utf8 encodings? There does not
> > appear
> > > to be a consensus yet on this point. One option is to only allow utf8
> > > encoding and force implementers to convert non-utf8 to utf8. Second
> > option
> > > is to allow all encodings and capture the encoding in the metadata (I'm
> > > leaning towards this option).
>
>
> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> > them only adds complexity for the tiny minority of producers of non-utf8
> > JSON.
>
>
> I'd also add that if we only allow extension on utf8 today, it would be a
> forward/backward compatible change to allow parameterizing the extension
> for bytes type by encoding if we wanted to support it in the future.
> Parquet also only supports UTF-8 [1] for its logical JSON type.
>
> [1]
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json
>
> On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
> > > Thanks for all the great feedback.
> > >
> > > To proceed forward, we seem to need decisions around the following:
> > >
> > > 1. Whether to use arrow extensions or first class types. The consensus is
> > > building towards using arrow extensions.
> >
> > +1
> >
> > > 2. What do we do about different non-utf8 encodings? There does not
> > appear
> > > to be a consensus yet on this point. One option is to only allow utf8
> > > encoding and force implementers to convert non-utf8 to utf8. Second
> > option
> > > is to allow all encodings and capture the encoding in the metadata (I'm
> > > leaning towards this option).
> >
> > Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> > them only adds complexity for the tiny minority of producers of non-utf8
> > JSON.
> >
> > > 3. What do we do about the different formats of JSON (string, BSON,
> > UBJSON,
> > > etc.)?
> >
> > There are no "different formats of JSON". BSON etc. are unrelated formats.
> >
> > Regards
> >
> > Antoine.
> >

Reply via email to