I also like these equivalence traits...in addition to being easy for extension type authors to specify when registering an extension type in Arrow C++, implementations that allow registration like pyarrow and arrow/R would be able to specify them easily, whereas implementing methods, compute functions, or overloads to handle it (e.g., like is done in vctrs with vec_proxy_equal, which often just returns its input) would have performance implications (since the methods might have to be defined in R or Python).
It may also be worth adding a compute function for "force storage" (a no-op for anything except an extension array), which is maybe safer than a cast (which implies, I think, some logical equivalence between the input and the result). That would let a user work around a situation where the extension type author didn't handle a case that the user expected to work. Cheers! -dewey On Fri, Dec 15, 2023 at 3:13 AM Jin Shang <shangjin1...@gmail.com> wrote: > > I'm in favor of Antoine's proposal of storage equivalence traits[1]. For > the sake of clarity I'll paste it here: > > I would suggest we perhaps need a more general semantic description of > > storage type equivalence. > > Draft: > > class ExtensionType { > > public: > > // Storage equivalence for equality testing and hashing > > static constexpr uint32_t kEquality = 1; > > // Storage equivalence for ordered comparisons > > static constexpr uint32_t kOrdering = 2; > > // Storage equivalence for selections (filter, take, etc.) > > static constexpr uint32_t kSelection = 4; > > // Storage equivalence for arithmetic > > static constexpr uint32_t kArithmetic = 8; > > // Storage equivalence for explicit casts > > static constexpr uint32_t kCasting = 16; > > // Storage equivalence for all operations > > static constexpr uint32_t kAny = std::numeric_limits<uint32_t>::max(); > > // By default, an extension type can be implicitly handled as its storage > > type > > // for selections, equality testing and hashing. > > virtual uint32_t storage_equivalence() const { return kEquality | > > kSelection; } > > > > I think this is well balanced between convenience and safety. The default > option ensures the "normal" operations like take, group-by, unique... just > work, and extension type authors can easily opt into additional functions. > > It also requires minimum engineering efforts. Each function only needs to > specify what traits it requires, rather than the actual types. > > BTW I've checked every C++ compute function and I think the only traits > missing here are one for string operations, and one for generation such as > `random`. > > [1] https://github.com/apache/arrow/pull/39200#issuecomment-1852307954 > > Best, > Jin > > On Thu, Dec 14, 2023 at 10:04 PM Weston Pace <weston.p...@gmail.com> wrote: > > > I agree engines can use their own strategy. Requiring explicit casts is > > probably ok as long as it is well documented but I think I lean slightly > > towards implicitly falling back to the storage type. I do think think > > people still shy away from extension types. Adding the extension type to > > an implicit cast registry is another hurdle to their use, albeit a small > > one. > > > > Substrait has a similar consideration for extension types. They can be > > declared "inherits" (meaning the storage type can be used implicitly in > > compute functions) or "separate" (meaning the storage type cannot be used > > implicitly in compute functions). This would map nicely to an Arrow > > metadata field. > > > > Unfortunately, I think the truth is more nuanced than a simple > > separate/inherits flag. Take UUID for example (everyone's favorite fixed > > size binary extension type). We would definitely want to implicitly reuse > > the hash, equality, and sorting functions. > > > > However, for other functions it gets trickier. Imagine you have a > > `replace_slice` function. Should it return a new UUID (change some bytes > > in a UUID and you have a new UUID) or not (once you start changing bytes in > > a UUID you no longer have a UUID). Or what if there was a `slice` > > function. This function should either be prohibited (you can't slice a > > UUID) or should return a fixed size binary string (you can still slice it > > but you no longer have a UUID). > > > > Given the complication I think users will always need to carefully consider > > each use of an extension function no matter how smart a system is. I'm not > > convinced any metadata exists that could define the right approach in a > > consistent number of cases. This means our choice is whether we force > > users to explicitly declare each such decision or we just trust that they > > are doing the proper consideration when they design their plan. I'm not > > sure there is a right answer. One can point to the vast diversity of ways > > that programming languages have approached implicit vs explicit integer > > casts. > > > > My last concern is that we rely on compute functions in operators other > > than project/filter. For example, to use a column as a key for a hash-join > > we need to be able to compute the hash value and calculate equality. To > > use a column as a key for sorting we need an ordering function. These are > > places where it might be unexpected for users to insert explicit casts. An > > engine would need to make sure the error message in these cases was very > > clear. > > > > On Wed, Dec 13, 2023 at 12:54 PM Antoine Pitrou <anto...@python.org> > > wrote: > > > > > > > > Hi, > > > > > > For now, I would suggest that each implementation decides on their own > > > strategy, because we don't have a clear idea of which is better (and > > > extension types are probably not getting a lot of use yet). > > > > > > Regards > > > > > > Antoine. > > > > > > > > > Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit : > > > > The main problem I see with adding properties to ExtensionType is I'm > > not > > > > sure where that information would reside. Allowing type authors to > > > declare > > > > in which ways the type is equivalent (or not) to its storage is > > > appealing, > > > > but it seems to need an official extension field like > > > > ARROW:extension:semantics. Otherwise I think each extension type's > > > > semantics would need to be maintained within every implementation as > > well > > > > as in a central reference (probably in Columnar.rst), which seems > > > > unreasonable to expect of extension type authors. I'm also skeptical > > that > > > > useful information could be packed into an ARROW:extension:semantics > > > field; > > > > even if the type can declare that ordering-as-with-storage is invalid I > > > > don't think it'd be feasible to specify the correct ordering. > > > > > > > > If we cannot attach this information to extension types, the question > > > > becomes which defaults are most reasonable for engines and how can the > > > > engine most usefully be configured outside those defaults. My own > > > > preference would be to refuse operations other than selection or > > > > casting-to-storage, with a runtime extensible registry of allowed > > > implicit > > > > casts. This will allow users of the engine to configure their extension > > > > types as they need, and the error message raised when an implicit > > > > cast-to-storage is not allowed could include the suggestion to register > > > the > > > > implicit cast. For applications built against a specific engine, this > > > > approach would allow recovering much of the advantage of attaching > > > > properties to an ExtensionType by including registration of implicit > > > casts > > > > in the ExtensionType's initialization. > > > > > > > > On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman < > > bengil...@gmail.com> > > > > wrote: > > > > > > > >> Hello all, > > > >> > > > >> Recently, a PR to arrow c++ [1] was opened to allow implicit casting > > > from > > > >> any extension type to its storage type in acero. This raises questions > > > >> about the validity of applying operations to an extension array's > > > storage. > > > >> For example, some extension type authors may intend different ordering > > > for > > > >> arrays of their new type than would be applied to the array's storage > > or > > > >> may not intend for the type to participate in arithmetic even though > > its > > > >> storage could. > > > >> > > > >> Suggestions/observations from discussion on that PR included: > > > >> - Extension types could provide general semantic description of > > storage > > > >> type equivalence [2], so that a flag on the extension type enables > > > ordering > > > >> by storage but disables arithmetic on it > > > >> - Compute functions or kernels could be augmented with a filter > > > declaring > > > >> which extension types are supported [3]. > > > >> - Currently arrow-rs considers extension types metadata only [4], so > > all > > > >> kernels treat extension arrays equivalently to their storage. > > > >> - Currently arrow c++ only supports explicitly casting from an > > extension > > > >> type to its storage (and from storage to ext), so any operation can be > > > >> performed on an extension array's storage but it requires opting in. > > > >> > > > >> Sincerely, > > > >> Ben Kietzman > > > >> > > > >> [1] https://github.com/apache/arrow/pull/39200 > > > >> [2] > > https://github.com/apache/arrow/pull/39200#issuecomment-1852307954 > > > >> [3] > > https://github.com/apache/arrow/pull/39200#issuecomment-1852676161 > > > >> [4] > > https://github.com/apache/arrow/pull/39200#issuecomment-1852881651 > > > >> > > > > > > > > >