To continue the ExtensionType part of this thread - I would like to add TensorArray [1] as an ExtensionType to Arrow but we have not yet agreed on an "official" location for "Well Known Extension Types".
Where could we put these? Some suggestions: * implementation folders (e.g. arrow/cpp/extensions/tensor_array.h) * extensions folder (e.g. arrow/extensions/cpp/tensor_array.h) * separate repo (e.g. github.com/apache/arrow_extensions/cpp/tensor_array.h) I'd be happy to also gather other Well Known Extension Types into one location if this moves forward. Rok [1] https://github.com/apache/arrow/pull/8510#issuecomment-991150389 On Sat, May 1, 2021 at 12:12 PM Andrew Lamb <[email protected]> wrote: > > I agree with others on this thread. Thanks for writing this down Micah > > On Fri, Apr 30, 2021 at 11:16 AM Antoine Pitrou <[email protected]> wrote: > > > > > I concur with both what Wes and Micah said. > > > > As for temporal types, they have wide-spread use and their semantics > > require dedicated treatment for arithmetic and conversion, so it's > > helpful to define dedicated types for them, as opposed to mere annotations. > > > > Regards > > > > Antoine. > > > > > > Le 30/04/2021 à 16:40, Wes McKinney a écrit : > > > I agree that the bar for adding new types to the Type union in Schema.fbs > > > should be quite high going forward. Using extension types increasingly > > for > > > adding specializations of built-in types will mean less burden for > > > implementations to simply "propagate forward" this data (by preserving > > the > > > extra metadata) even if they don't understand what it does. It would be > > > nice, therefore, to put us on a path to expanding our set of "official" > > > extension types (which would include things like JSON or UUID) since some > > > libraries may choose to implement convenience containers for these for > > > usability. > > > > > > On Fri, Apr 30, 2021 at 9:22 AM Brian Hulette <[email protected]> > > wrote: > > > > > >> +1 this looks good to me. > > >> > > >> My only concern is with criteria #3 " Is the underlying encoding of the > > >> type already semantically supported by a type?". I think this is a good > > >> criteria, but it's inconsistent with the current spec. By that criteria > > >> some existing types (Timestamp, Time, Duration, Date) should be well > > known > > >> extension types, right? > > >> > > >> Perhaps we should explicitly indicate these types are grandfathered in > > [1] > > >> because they existed before extension types, to avoid tension with this > > >> criteria. > > >> > > >> Brian > > >> > > >> [1] https://en.wikipedia.org/wiki/Grandfather_clause > > >> > > >> On Thu, Apr 29, 2021 at 9:13 PM Jorge Cardoso Leitão < > > >> [email protected]> wrote: > > >> > > >>> Thanks for writing this. > > >>> > > >>> I agree. That is a good decision tree. +1 > > >>> > > >>> Best, > > >>> Jorge > > >>> > > >>> > > >>> On Thu, Apr 29, 2021 at 6:08 PM Micah Kornfield <[email protected] > > > > > >>> wrote: > > >>> > > >>>> The discussion around adding another interval type to the Schema.fbs > > >>> raises > > >>>> the issue of when do we decide to add a new type to the Schema.fbs vs > > >>> using > > >>>> other means (primarily extension types [1]). > > >>>> > > >>>> A few criteria come to mind that could help decide (feedback welcome): > > >>>> > > >>>> 1. Is the type a new parameterization of an existing type? > > >>>> - If Yes, and we believe the parameterization is useful and can > > be > > >>> done > > >>>> in a forward/backward compatible manner then we would update > > >> Schema.fbs. > > >>>> > > >>>> 2. Does the type itself have its own specification for processing > > >> (e.g. > > >>>> JSON, BSON, Thrift, Avro, Protobuf)? > > >>>> - If yes, we would NOT add them to Schema.fbs. I think this would > > >>>> potentially yield too many new types. > > >>>> > > >>>> 3. Is the underlying encoding of the type already semantically > > >> supported > > >>>> by a type? (e.g. if we want to encode physical lengths like meters > > >> these > > >>>> can be represented by an integer). > > >>>> - If yes, we would NOT update the specification. This seems like > > >> the > > >>>> exact use-case that extension types are meant for. > > >>>> > > >>>> * How does this apply to Interval? * > > >>>> Interval extends an existing type in the specification and multiple > > >>> "packed > > >>>> fields" cannot be easily communicated with the current version of the > > >>>> specification. Hence, I feel comfortable making the addition to > > >>> Schema.fbs > > >>>> > > >>>> * What does this mean for other common types? * > > >>>> > > >>>> I think as types come up that are very common but we don't want to add > > >> to > > >>>> the Schema.fbs we should invest in formalizing them as "Well Known" > > >>>> Extension types. In this scenario, we would update the specification > > >> to > > >>>> include how to specify the extension type metadata (and still require > > >> at > > >>>> least two libraries support the Extension type before inclusion as > > >> "Well > > >>>> Known"). > > >>>> > > >>>> * Practical implications * > > >>>> > > >>>> I think this means the type system in Schema.fbs is mostly closed > > (i.e. > > >>>> there is a high bar for adding new types). One potentially useful type > > >> to > > >>>> have would be a "packed struct" that supports something similar to > > >> python > > >>>> struct library [2]. I think this would likely cover many extension > > >> type > > >>>> use-cases. > > >>>> > > >>>> Thoughts? > > >>>> > > >>>> -Micah > > >>>> > > >>>> [1] > > https://arrow.apache.org/docs/format/Columnar.html#extension-types > > >>>> [2] https://docs.python.org/3/library/struct.html > > >>>> > > >>> > > >> > > > > >
