I should add that since Parquet has JSON, BSON, and UUID types, that while UUID is just a simple fixed sized binary, that having the extension types so that the metadata flows through accurately to Parquet would be net beneficial:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L342 Implementing JSON (and BSON and UUID if we want them) as extension types and restricting JSON to UTF-8 sounds good to me. On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > > > 2. What do we do about different non-utf8 encodings? There does not > > appear > > > to be a consensus yet on this point. One option is to only allow utf8 > > > encoding and force implementers to convert non-utf8 to utf8. Second > > option > > > is to allow all encodings and capture the encoding in the metadata (I'm > > > leaning towards this option). > > > Allowing non-utf8 encodings adds complexity for everyone. Disallowing > > them only adds complexity for the tiny minority of producers of non-utf8 > > JSON. > > > I'd also add that if we only allow extension on utf8 today, it would be a > forward/backward compatible change to allow parameterizing the extension > for bytes type by encoding if we wanted to support it in the future. > Parquet also only supports UTF-8 [1] for its logical JSON type. > > [1] > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json > > On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit : > > > Thanks for all the great feedback. > > > > > > To proceed forward, we seem to need decisions around the following: > > > > > > 1. Whether to use arrow extensions or first class types. The consensus is > > > building towards using arrow extensions. > > > > +1 > > > > > 2. What do we do about different non-utf8 encodings? There does not > > appear > > > to be a consensus yet on this point. One option is to only allow utf8 > > > encoding and force implementers to convert non-utf8 to utf8. Second > > option > > > is to allow all encodings and capture the encoding in the metadata (I'm > > > leaning towards this option). > > > > Allowing non-utf8 encodings adds complexity for everyone. Disallowing > > them only adds complexity for the tiny minority of producers of non-utf8 > > JSON. > > > > > 3. What do we do about the different formats of JSON (string, BSON, > > UBJSON, > > > etc.)? > > > > There are no "different formats of JSON". BSON etc. are unrelated formats. > > > > Regards > > > > Antoine. > >