Thank you for opening the discussion here and opening it up! I agree that attaching semantics as metadata and/or documenting them in a central repository is an unreasonable burden to put on extension type authors and Arrow implementations in general.
I also agree that operations other than filter/take/concatenate should error by default: just because a storage type happens to be an integer, it doesn't necessarily mean that arithmetic (for example) is meaningful. (For example, an extension type implementing a bitpacked uint64 such as an S2 cell or H3 index would result in an invalid value for "plus one" or "times three"). For query engines and/or implementations with extensive compute capability like Arrow C++, it is useful to be able to leverage those for extension types: for the S2/H3 index example, it would be very cool to allow a group_by + aggregate to "just work" (since ==/hash *is* valid for this example), although I don't imagine it's a development priority for anybody right now. I agree with Antoine that implementations should be able to choose how/if extension type authors can leverage other capabilities of the engine. If this is pursued further, it might be worth checking out a particularly successful extensible vector system implemented in R via the vctrs package ( https://vctrs.r-lib.org/ ). "vector" class authors can implement one or more S3 methods (i.e., traits): - vec_proxy(x) (get me the storage array) - vec_ptype2(type1, type2) (given two types, get me a type that I can cast both to or error) - vec_cast(x, type) (perform a lossless cast to type or error) - vec_proxy_equal(x) (get me storage array where == does the right thing) - vec_proxy_order(x) (get me a storage array that sorts in the correct order) - vec_math(op, x) (perform unary math ops like sum) - vec_arith(op, lhs, rhs) (perform binary math ops like +, -, etc.) Cheers! -dewey On Wed, Dec 13, 2023 at 12:39 PM Benjamin Kietzman <bengil...@gmail.com> wrote: > > The main problem I see with adding properties to ExtensionType is I'm not > sure where that information would reside. Allowing type authors to declare > in which ways the type is equivalent (or not) to its storage is appealing, > but it seems to need an official extension field like > ARROW:extension:semantics. Otherwise I think each extension type's > semantics would need to be maintained within every implementation as well > as in a central reference (probably in Columnar.rst), which seems > unreasonable to expect of extension type authors. I'm also skeptical that > useful information could be packed into an ARROW:extension:semantics field; > even if the type can declare that ordering-as-with-storage is invalid I > don't think it'd be feasible to specify the correct ordering. > > If we cannot attach this information to extension types, the question > becomes which defaults are most reasonable for engines and how can the > engine most usefully be configured outside those defaults. My own > preference would be to refuse operations other than selection or > casting-to-storage, with a runtime extensible registry of allowed implicit > casts. This will allow users of the engine to configure their extension > types as they need, and the error message raised when an implicit > cast-to-storage is not allowed could include the suggestion to register the > implicit cast. For applications built against a specific engine, this > approach would allow recovering much of the advantage of attaching > properties to an ExtensionType by including registration of implicit casts > in the ExtensionType's initialization. > > On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman <bengil...@gmail.com> > wrote: > > > Hello all, > > > > Recently, a PR to arrow c++ [1] was opened to allow implicit casting from > > any extension type to its storage type in acero. This raises questions > > about the validity of applying operations to an extension array's storage. > > For example, some extension type authors may intend different ordering for > > arrays of their new type than would be applied to the array's storage or > > may not intend for the type to participate in arithmetic even though its > > storage could. > > > > Suggestions/observations from discussion on that PR included: > > - Extension types could provide general semantic description of storage > > type equivalence [2], so that a flag on the extension type enables ordering > > by storage but disables arithmetic on it > > - Compute functions or kernels could be augmented with a filter declaring > > which extension types are supported [3]. > > - Currently arrow-rs considers extension types metadata only [4], so all > > kernels treat extension arrays equivalently to their storage. > > - Currently arrow c++ only supports explicitly casting from an extension > > type to its storage (and from storage to ext), so any operation can be > > performed on an extension array's storage but it requires opting in. > > > > Sincerely, > > Ben Kietzman > > > > [1] https://github.com/apache/arrow/pull/39200 > > [2] https://github.com/apache/arrow/pull/39200#issuecomment-1852307954 > > [3] https://github.com/apache/arrow/pull/39200#issuecomment-1852676161 > > [4] https://github.com/apache/arrow/pull/39200#issuecomment-1852881651 > >