Hi,

For now, I would suggest that each implementation decides on their own strategy, because we don't have a clear idea of which is better (and extension types are probably not getting a lot of use yet).

Regards

Antoine.


Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit :
The main problem I see with adding properties to ExtensionType is I'm not
sure where that information would reside. Allowing type authors to declare
in which ways the type is equivalent (or not) to its storage is appealing,
but it seems to need an official extension field like
ARROW:extension:semantics. Otherwise I think each extension type's
semantics would need to be maintained within every implementation as well
as in a central reference (probably in Columnar.rst), which seems
unreasonable to expect of extension type authors. I'm also skeptical that
useful information could be packed into an ARROW:extension:semantics field;
even if the type can declare that ordering-as-with-storage is invalid I
don't think it'd be feasible to specify the correct ordering.

If we cannot attach this information to extension types, the question
becomes which defaults are most reasonable for engines and how can the
engine most usefully be configured outside those defaults. My own
preference would be to refuse operations other than selection or
casting-to-storage, with a runtime extensible registry of allowed implicit
casts. This will allow users of the engine to configure their extension
types as they need, and the error message raised when an implicit
cast-to-storage is not allowed could include the suggestion to register the
implicit cast. For applications built against a specific engine, this
approach would allow recovering much of the advantage of attaching
properties to an ExtensionType by including registration of implicit casts
in the ExtensionType's initialization.

On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman <bengil...@gmail.com>
wrote:

Hello all,

Recently, a PR to arrow c++ [1] was opened to allow implicit casting from
any extension type to its storage type in acero. This raises questions
about the validity of applying operations to an extension array's storage.
For example, some extension type authors may intend different ordering for
arrays of their new type than would be applied to the array's storage or
may not intend for the type to participate in arithmetic even though its
storage could.

Suggestions/observations from discussion on that PR included:
- Extension types could provide general semantic description of storage
type equivalence [2], so that a flag on the extension type enables ordering
by storage but disables arithmetic on it
- Compute functions or kernels could be augmented with a filter declaring
which extension types are supported [3].
- Currently arrow-rs considers extension types metadata only [4], so all
kernels treat extension arrays equivalently to their storage.
- Currently arrow c++ only supports explicitly casting from an extension
type to its storage (and from storage to ext), so any operation can be
performed on an extension array's storage but it requires opting in.

Sincerely,
Ben Kietzman

[1] https://github.com/apache/arrow/pull/39200
[2] https://github.com/apache/arrow/pull/39200#issuecomment-1852307954
[3] https://github.com/apache/arrow/pull/39200#issuecomment-1852676161
[4] https://github.com/apache/arrow/pull/39200#issuecomment-1852881651


Reply via email to