Those all seem to be C++ locations.  If we want to define
cross-implementation "Well Known Extension Types" then it seems we
would want to come up with some kind of language independent agreement
(could just be a markdown file but maybe there is some advantage to
having something programmatically consumable) describing:

* The name of the extension type (to go in ARROW:extension:name)
* A description of the extension type and how it should be used
* The storage type of the extension type
* The format and meaning of the content that will go into
ARROW:extension:metadata

I think (but am not sure) that, since these are metadata keys, we are
supposed to stick to printable ASCII for values (for backwards
compatibility).

For example, in the docs, we currently have this little blurb about a
theoretical tensor extension type:

> tensor (multidimensional array) stored as Binary values and
> having serialized metadata indicating the data type and shape
> of each value. This could be JSON like {'type': 'int8', 'shape':
> [4, 5]} for a 4x5 cell tensor.

In my mind this file would be somewhat analogous to the way that
schema.fbs is the cross implementation "ground truth" for our logical
types.

Then the C++ implementation would be free to put the implementation
(I'd vote for arrow/cpp/extensions but a separate repo is probably ok.
I'm -1 on arrow/extensions/...)

On Thu, Jan 20, 2022 at 3:20 PM Rok Mihevc <rok.mih...@gmail.com> wrote:
>
> To continue the ExtensionType part of this thread - I would like to
> add TensorArray [1] as an ExtensionType to Arrow but we have not yet
> agreed on an "official" location for "Well Known Extension Types".
>
> Where could we put these? Some suggestions:
>
> * implementation folders (e.g. arrow/cpp/extensions/tensor_array.h)
> * extensions folder (e.g. arrow/extensions/cpp/tensor_array.h)
> * separate repo (e.g. github.com/apache/arrow_extensions/cpp/tensor_array.h)
>
> I'd be happy to also gather other Well Known Extension Types into one
> location if this moves forward.
>
> Rok
>
> [1] https://github.com/apache/arrow/pull/8510#issuecomment-991150389
>
> On Sat, May 1, 2021 at 12:12 PM Andrew Lamb <al...@influxdata.com> wrote:
> >
> > I agree with others on this thread. Thanks for writing this down Micah
> >
> > On Fri, Apr 30, 2021 at 11:16 AM Antoine Pitrou <anto...@python.org> wrote:
> >
> > >
> > > I concur with both what Wes and Micah said.
> > >
> > > As for temporal types, they have wide-spread use and their semantics
> > > require dedicated treatment for arithmetic and conversion, so it's
> > > helpful to define dedicated types for them, as opposed to mere 
> > > annotations.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 30/04/2021 à 16:40, Wes McKinney a écrit :
> > > > I agree that the bar for adding new types to the Type union in 
> > > > Schema.fbs
> > > > should be quite high going forward. Using extension types increasingly
> > > for
> > > > adding specializations of built-in types will mean less burden for
> > > > implementations to simply "propagate forward" this data (by preserving
> > > the
> > > > extra metadata) even if they don't understand what it does. It would be
> > > > nice, therefore, to put us on a path to expanding our set of "official"
> > > > extension types (which would include things like JSON or UUID) since 
> > > > some
> > > > libraries may choose to implement convenience containers for these for
> > > > usability.
> > > >
> > > > On Fri, Apr 30, 2021 at 9:22 AM Brian Hulette <bhule...@apache.org>
> > > wrote:
> > > >
> > > >> +1 this looks good to me.
> > > >>
> > > >> My only concern is with criteria #3 " Is the underlying encoding of the
> > > >> type already semantically supported by a type?". I think this is a good
> > > >> criteria, but it's inconsistent with the current spec. By that criteria
> > > >> some existing types (Timestamp, Time, Duration, Date) should be well
> > > known
> > > >> extension types, right?
> > > >>
> > > >> Perhaps we should explicitly indicate these types are grandfathered in
> > > [1]
> > > >> because they existed before extension types, to avoid tension with this
> > > >> criteria.
> > > >>
> > > >> Brian
> > > >>
> > > >> [1] https://en.wikipedia.org/wiki/Grandfather_clause
> > > >>
> > > >> On Thu, Apr 29, 2021 at 9:13 PM Jorge Cardoso Leitão <
> > > >> jorgecarlei...@gmail.com> wrote:
> > > >>
> > > >>> Thanks for writing this.
> > > >>>
> > > >>> I agree. That is a good decision tree. +1
> > > >>>
> > > >>> Best,
> > > >>> Jorge
> > > >>>
> > > >>>
> > > >>> On Thu, Apr 29, 2021 at 6:08 PM Micah Kornfield <emkornfi...@gmail.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>>> The discussion around adding another interval type to the Schema.fbs
> > > >>> raises
> > > >>>> the issue of when do we decide to add a new type to the Schema.fbs vs
> > > >>> using
> > > >>>> other means (primarily extension types [1]).
> > > >>>>
> > > >>>> A few criteria come to mind that could help decide (feedback 
> > > >>>> welcome):
> > > >>>>
> > > >>>> 1.  Is the type a new parameterization of an existing type?
> > > >>>>      - If Yes, and we believe the parameterization is useful and can
> > > be
> > > >>> done
> > > >>>> in a forward/backward compatible manner then we would update
> > > >> Schema.fbs.
> > > >>>>
> > > >>>> 2.  Does the type itself have its own specification for processing
> > > >> (e.g.
> > > >>>> JSON, BSON, Thrift, Avro, Protobuf)?
> > > >>>>    - If yes, we would NOT add them to Schema.fbs.  I think this would
> > > >>>> potentially yield too many new types.
> > > >>>>
> > > >>>> 3.  Is the underlying encoding of the type already semantically
> > > >> supported
> > > >>>> by a type? (e.g. if we want to encode physical lengths like meters
> > > >> these
> > > >>>> can be represented by an integer).
> > > >>>>     - If yes, we would NOT update the specification.  This seems like
> > > >> the
> > > >>>> exact use-case that extension types are meant for.
> > > >>>>
> > > >>>> * How does this apply to Interval? *
> > > >>>> Interval extends an existing type in the specification and multiple
> > > >>> "packed
> > > >>>> fields" cannot be easily communicated with the current version of the
> > > >>>> specification.  Hence, I feel comfortable making the addition to
> > > >>> Schema.fbs
> > > >>>>
> > > >>>> * What does this mean for other common types? *
> > > >>>>
> > > >>>> I think as types come up that are very common but we don't want to 
> > > >>>> add
> > > >> to
> > > >>>> the Schema.fbs we should invest in formalizing them as "Well Known"
> > > >>>> Extension types.  In this scenario, we would update the specification
> > > >> to
> > > >>>> include how to specify the extension type metadata (and still require
> > > >> at
> > > >>>> least two libraries support the Extension type before inclusion as
> > > >> "Well
> > > >>>> Known").
> > > >>>>
> > > >>>> * Practical implications *
> > > >>>>
> > > >>>> I think this means the type system in Schema.fbs is mostly closed
> > > (i.e.
> > > >>>> there is a high bar for adding new types). One potentially useful 
> > > >>>> type
> > > >> to
> > > >>>> have would be a "packed struct" that supports something similar to
> > > >> python
> > > >>>> struct library [2].  I think this would likely cover many extension
> > > >> type
> > > >>>> use-cases.
> > > >>>>
> > > >>>> Thoughts?
> > > >>>>
> > > >>>> -Micah
> > > >>>>
> > > >>>> [1]
> > > https://arrow.apache.org/docs/format/Columnar.html#extension-types
> > > >>>> [2] https://docs.python.org/3/library/struct.html
> > > >>>>
> > > >>>
> > > >>
> > > >
> > >

Reply via email to