I also like these equivalence traits...in addition to being easy for
extension type authors to specify when registering an extension type
in Arrow C++, implementations that allow registration like pyarrow and
arrow/R would be able to specify them easily, whereas implementing
methods, compute functions, or overloads to handle it (e.g., like is
done in vctrs with vec_proxy_equal, which often just returns its
input) would have performance implications (since the methods might
have to be defined in R or Python).

It may also be worth adding a compute function for "force storage" (a
no-op for anything except an extension array), which is maybe safer
than a cast (which implies, I think, some logical equivalence between
the input and the result). That would let a user work around a
situation where the extension type author didn't handle a case that
the user expected to work.

Cheers!

-dewey

On Fri, Dec 15, 2023 at 3:13 AM Jin Shang <shangjin1...@gmail.com> wrote:
>
> I'm in favor of Antoine's proposal of storage equivalence traits[1]. For
> the sake of clarity I'll paste it here:
>
> I would suggest we perhaps need a more general semantic description of
> > storage type equivalence.
> > Draft:
> > class ExtensionType {
> > public:
> > // Storage equivalence for equality testing and hashing
> > static constexpr uint32_t kEquality = 1;
> > // Storage equivalence for ordered comparisons
> > static constexpr uint32_t kOrdering = 2;
> > // Storage equivalence for selections (filter, take, etc.)
> > static constexpr uint32_t kSelection = 4;
> > // Storage equivalence for arithmetic
> > static constexpr uint32_t kArithmetic = 8;
> > // Storage equivalence for explicit casts
> > static constexpr uint32_t kCasting = 16;
> > // Storage equivalence for all operations
> > static constexpr uint32_t kAny = std::numeric_limits<uint32_t>::max();
> > // By default, an extension type can be implicitly handled as its storage
> > type
> > // for selections, equality testing and hashing.
> > virtual uint32_t storage_equivalence() const { return kEquality |
> > kSelection; }
> >
>
> I think this is well balanced between convenience and safety. The default
> option ensures the "normal" operations like take, group-by, unique... just
> work, and extension type authors can easily opt into additional functions.
>
> It also requires minimum engineering efforts. Each function only needs to
> specify what traits it requires, rather than the actual types.
>
> BTW I've checked every C++ compute function and I think the only traits
> missing here are one for string operations, and one for generation such as
> `random`.
>
> [1]  https://github.com/apache/arrow/pull/39200#issuecomment-1852307954
>
> Best,
> Jin
>
> On Thu, Dec 14, 2023 at 10:04 PM Weston Pace <weston.p...@gmail.com> wrote:
>
> > I agree engines can use their own strategy.  Requiring explicit casts is
> > probably ok as long as it is well documented but I think I lean slightly
> > towards implicitly falling back to the storage type.  I do think think
> > people still shy away from extension types.  Adding the extension type to
> > an implicit cast registry is another hurdle to their use, albeit a small
> > one.
> >
> > Substrait has a similar consideration for extension types.  They can be
> > declared "inherits" (meaning the storage type can be used implicitly in
> > compute functions) or "separate" (meaning the storage type cannot be used
> > implicitly in compute functions).  This would map nicely to an Arrow
> > metadata field.
> >
> > Unfortunately, I think the truth is more nuanced than a simple
> > separate/inherits flag.  Take UUID for example (everyone's favorite fixed
> > size binary extension type).  We would definitely want to implicitly reuse
> > the hash, equality, and sorting functions.
> >
> > However, for other functions it gets trickier.  Imagine you have a
> > `replace_slice` function.  Should it return a new UUID (change some bytes
> > in a UUID and you have a new UUID) or not (once you start changing bytes in
> > a UUID you no longer have a UUID).  Or what if there was a `slice`
> > function.  This function should either be prohibited (you can't slice a
> > UUID) or should return a fixed size binary string (you can still slice it
> > but you no longer have a UUID).
> >
> > Given the complication I think users will always need to carefully consider
> > each use of an extension function no matter how smart a system is.  I'm not
> > convinced any metadata exists that could define the right approach in a
> > consistent number of cases.  This means our choice is whether we force
> > users to explicitly declare each such decision or we just trust that they
> > are doing the proper consideration when they design their plan.  I'm not
> > sure there is a right answer.  One can point to the vast diversity of ways
> > that programming languages have approached implicit vs explicit integer
> > casts.
> >
> > My last concern is that we rely on compute functions in operators other
> > than project/filter.  For example, to use a column as a key for a hash-join
> > we need to be able to compute the hash value and calculate equality.  To
> > use a column as a key for sorting we need an ordering function.  These are
> > places where it might be unexpected for users to insert explicit casts.  An
> > engine would need to make sure the error message in these cases was very
> > clear.
> >
> > On Wed, Dec 13, 2023 at 12:54 PM Antoine Pitrou <anto...@python.org>
> > wrote:
> >
> > >
> > > Hi,
> > >
> > > For now, I would suggest that each implementation decides on their own
> > > strategy, because we don't have a clear idea of which is better (and
> > > extension types are probably not getting a lot of use yet).
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit :
> > > > The main problem I see with adding properties to ExtensionType is I'm
> > not
> > > > sure where that information would reside. Allowing type authors to
> > > declare
> > > > in which ways the type is equivalent (or not) to its storage is
> > > appealing,
> > > > but it seems to need an official extension field like
> > > > ARROW:extension:semantics. Otherwise I think each extension type's
> > > > semantics would need to be maintained within every implementation as
> > well
> > > > as in a central reference (probably in Columnar.rst), which seems
> > > > unreasonable to expect of extension type authors. I'm also skeptical
> > that
> > > > useful information could be packed into an ARROW:extension:semantics
> > > field;
> > > > even if the type can declare that ordering-as-with-storage is invalid I
> > > > don't think it'd be feasible to specify the correct ordering.
> > > >
> > > > If we cannot attach this information to extension types, the question
> > > > becomes which defaults are most reasonable for engines and how can the
> > > > engine most usefully be configured outside those defaults. My own
> > > > preference would be to refuse operations other than selection or
> > > > casting-to-storage, with a runtime extensible registry of allowed
> > > implicit
> > > > casts. This will allow users of the engine to configure their extension
> > > > types as they need, and the error message raised when an implicit
> > > > cast-to-storage is not allowed could include the suggestion to register
> > > the
> > > > implicit cast. For applications built against a specific engine, this
> > > > approach would allow recovering much of the advantage of attaching
> > > > properties to an ExtensionType by including registration of implicit
> > > casts
> > > > in the ExtensionType's initialization.
> > > >
> > > > On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman <
> > bengil...@gmail.com>
> > > > wrote:
> > > >
> > > >> Hello all,
> > > >>
> > > >> Recently, a PR to arrow c++ [1] was opened to allow implicit casting
> > > from
> > > >> any extension type to its storage type in acero. This raises questions
> > > >> about the validity of applying operations to an extension array's
> > > storage.
> > > >> For example, some extension type authors may intend different ordering
> > > for
> > > >> arrays of their new type than would be applied to the array's storage
> > or
> > > >> may not intend for the type to participate in arithmetic even though
> > its
> > > >> storage could.
> > > >>
> > > >> Suggestions/observations from discussion on that PR included:
> > > >> - Extension types could provide general semantic description of
> > storage
> > > >> type equivalence [2], so that a flag on the extension type enables
> > > ordering
> > > >> by storage but disables arithmetic on it
> > > >> - Compute functions or kernels could be augmented with a filter
> > > declaring
> > > >> which extension types are supported [3].
> > > >> - Currently arrow-rs considers extension types metadata only [4], so
> > all
> > > >> kernels treat extension arrays equivalently to their storage.
> > > >> - Currently arrow c++ only supports explicitly casting from an
> > extension
> > > >> type to its storage (and from storage to ext), so any operation can be
> > > >> performed on an extension array's storage but it requires opting in.
> > > >>
> > > >> Sincerely,
> > > >> Ben Kietzman
> > > >>
> > > >> [1] https://github.com/apache/arrow/pull/39200
> > > >> [2]
> > https://github.com/apache/arrow/pull/39200#issuecomment-1852307954
> > > >> [3]
> > https://github.com/apache/arrow/pull/39200#issuecomment-1852676161
> > > >> [4]
> > https://github.com/apache/arrow/pull/39200#issuecomment-1852881651
> > > >>
> > > >
> > >
> >

Reply via email to