tobixdev opened a new pull request, #15106:
URL: https://github.com/apache/datafusion/pull/15106
Related to: #14828, #12644, #14247
This PR is a *rough* proposal for an implementation of user-defined sorting.
The main goal is to get a discussion starting whether this is a direction we
want to go.
Some design considerations that help understand the PR:
- `PhysicalSortExpr` allows defining a custom sort order, but does not know
anything about extension types.
- "Resolving" logical types happens during the creation of the initial
physical plan.
- `LogicalTypePlanningInformation` allows parameterizing this creation
process
- Obtaining the logical type is currently done by inspecting the `Field`
whether there is an entry in the metadata with the key
`EXTENSION_TYPE_NAME_KEY`.
- An extension type registry is required for making use of this
information.
- Currently, this works only for direct access to columns. See #14247
for some discussions on this issue.
- In the future, maybe we can move `DFSchema` towards including a
`LogicalType` and a `DataType` (at least as a first step). The remaining
approach still works with such an assumption.
This is similar to "Option 1: User defined operators" from the discussion in
#14247. Here are some thoughts on that:
- From a logical standpoint, the expressions themselves do not change. It's
still a `SortExpr` just on a special type.
- Rewriting all `Expr` to UDFs via analyzers can be also tricky as we would
then, for example, need to parameterize `SortExpr` with a custom ordering to
allow such a rewriting.
- This makes the physical plan creation more involved.
- I think this is a great avenue to making it easy to work with custom types
in DataFusion. Currently, AFAIK, if you want to do custom sorting or use
`AggregateExec` (which assumes your data has a natural order) for your
extension types you basically have to re implement a custom physical node or
resort to work-arounds that cause problems (e.g., projecting out columns that
are sortable).
There are many changes in tests etc. so here is a list of "highlights" you
should check out when taking a look at this PR.
Highlights:
- `datafusion/common/src/sort.rs`: new sort data structures and adapted
procedures from arrow-rs.
- `datafusion/common/src/types/logical.rs`: Extension to logical type to
provide a `LogicalTypePlanningInformation`.
- `datafusion/core/tests/dataframe/test_types.rs`: A logical type
(`IntOrFloatType`) that implements a custom order.
- `datafusion/core/tests/dataframe/mod.rs`: A test for sorting a on union
type.
- `datafusion/core/src/datasource/mod.rs`: Applying custom order during
physical plan creation.
- `datafusion/core/src/physical_planner.rs`: Applying custom order during
physical plan creation.
To get (a similar) approach upstream I'd suggest the following steps
(splitting up the PR, adding tests etc.):
1. Add `SortOrdering` to `PhysicalSortExpr` and add support for user-defined
sorting operations (`DynComparator` for now)
2. Add `ExtensionTypeRegistry` to support registering extension types.
3.1 Use said registry to look up whether we have planning information for
`SortExpr` during planning and construct the correct `PhysicalSortExpr`.
3.X Adapt existing code that relies on the fact that types have a natural
ordering (e.g., `AggregateExec`). This step will be done interatively.
Note that this procedure leads to a state where some parts of DF already
make use of user defined sorting, while other do not yet support it. However, I
don't think this is a deal breaker.
I'm eager to hear your thoughts!
# Further Notes
I think this would also require users to return a `Field` from a UDF instead
of just a data type (#14247) to allow UDFs to return extension types. See TODO
note in `datafusion/core/src/physical_planner.rs`.
I think supported the row-based sorting from arrow-rs can also be supported
in a similar fashion by extending `CustomOrdering`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]