kosiew opened a new pull request, #1221:
URL: https://github.com/apache/datafusion-python/pull/1221
## Which issue does this PR close?
* Closes #1214
## Rationale for this change
Users reported inconsistent behavior when using plain string column
identifiers vs. explicit `col()`/`column()` and `lit()`/`literal()`
expressions. Some `DataFrame` methods accepted bare strings (e.g. `select`,
`sort`, `drop`), while expression-returning APIs (e.g. `filter`, `with_column`)
required `Expr` objects. This led to confusion and surprising runtime errors.
This PR clarifies and documents the intended behavior by:
* Accepting plain string column identifiers where passing a column *name* is
natural (e.g. `sort`, `aggregate` grouping keys, many `order_by` parameters,
file sort metadata).
* Requiring explicit `Expr` objects for APIs that expect an *expression*
(e.g. `filter`, `with_column`, `with_columns` elements, `join_on` predicates).
When a non-`Expr` is passed to such APIs, a clear `TypeError` message guides
the user to use `col()`/`column()` or `lit()`/`literal()`.
* Updating documentation and tests to reflect the semantics and show
examples.
This makes usages consistent and explicit while preserving ergonomics where
obvious (accepting column-name strings).
## What changes are included in this PR?
**High-level summary**
* Add robust helpers to validate and convert user-provided values:
* `ensure_expr(value)` — validate and return the internal expression
object or raise `TypeError` with a helpful message.
* `ensure_expr_list(iterable)` — flatten and validate nested iterables of
`Expr` objects.
* `_to_raw_expr(value)` — convert an `Expr` or `str` column name to the
internal raw expression.
* Introduce `SortKey = Expr | SortExpr | str` type alias to represent
items accepted by sort-like APIs.
* Allow column-name strings in many APIs that logically take column
identifiers: `DataFrame.sort`, `DataFrame.aggregate` (group-by keys),
`Window.order_by`, various `functions.*` `order_by` parameters, and
`SessionContext` file sort-order metadata.
* Enforce explicit `Expr` values in expression-only APIs and provide clear
error messages referencing `col()/column()` and `lit()/literal()` helpers.
* Update Python doc (`docs/source/user-guide/dataframe/index.rst`) to
explain which methods accept strings and which require `col()`/`lit()`.
* Add tests covering string acceptance and error cases (extensive additions
to `python/tests/test_dataframe.py` and `python/tests/test_expr.py`).
**Files changed (high level)**
* `python/datafusion/expr.py`
* Added helpers: `ensure_expr`, `ensure_expr_list`, `_to_raw_expr`.
* Made `expr_list_to_raw_expr_list` accept `str` in addition to `Expr` and
convert accordingly.
* Made `sort_list_to_raw_sort_list` accept `str` and `SortKey` and convert
entries to raw `SortExpr`.
* Added `SortKey` alias and `EXPR_TYPE_ERROR` constant for consistent
error messages.
* `python/datafusion/dataframe.py`
* Use `ensure_expr`/`ensure_expr_list` to validate expressions in
`filter`, `with_column(s)`, `join_on`, `aggregate`, and other places.
* Allow column name strings for `sort`, `aggregate` group-by, and related
APIs via conversions.
* Added user-facing docstrings clarifying expected types for parameters
and examples.
* `python/datafusion/context.py`
* Accept `file_sort_order` as nested `SortKey` sequences and added
`_convert_file_sort_order` to create the low-level representation for the Rust
bindings.
* Import the internal `expr` namespace as `expr_internal` where required.
* `python/datafusion/functions.py`
* Updated numerous function signatures to accept `SortKey` for `order_by`
parameters.
* Updated docstrings and added usage examples where helpful.
* `docs/source/user-guide/dataframe/index.rst`
* New section: "String Columns and Expressions" documenting which methods
accept plain strings and which require explicit `col()`/`lit()` expressions,
with examples.
* `python/tests/test_dataframe.py`, `python/tests/test_expr.py`
* Added many tests exercising both permissive string-accepting behavior
and strict expression validation.
**Notable implementation details**
* When a user supplies a plain string for sort/aggregate/window `order_by`
or `group_by`, we convert the string into `Expr.column(name)` prior to calling
the lower-level bindings.
* For APIs where an expression is required (e.g., `filter`, `with_column`,
`join_on`), passing a plain string now raises `TypeError` with the message:
`Use col()/column() or lit()/literal() to construct expressions` (exposed via
`EXPR_TYPE_ERROR`). Tests assert on that message to ensure consistent behavior.
## Are these changes tested?
Yes. This PR adds and updates unit tests to cover:
* Accepting column-name strings for `sort`, `aggregate` group-by,
`window.order_by`, `array_agg(..., order_by=...)`,
`first_value/last_value/nth_value(..., order_by=...)`, `lead/lag`,
`row_number`/`rank`/`dense_rank`/`percent_rank`/`cume_dist`/`ntile`, and
`SessionContext` file sort order conversions.
* Error paths for passing non-`Expr` types to expression-only APIs
(`filter`, `with_column`, `with_columns`, `join_on`, etc.).
* Tests ensure that string-based usages are equivalent to their `col("...")`
counterparts when allowed.
Specifically modified/added tests include (non-exhaustive list):
* `python/tests/test_dataframe.py` — many new tests:
`test_select_unsupported`, `test_sort_string_and_expression_equivalent`,
`test_sort_unsupported`, `test_aggregate_string_and_expression_equivalent`,
`test_aggregate_tuple_group_by`, `test_filter_string_unsupported`,
`test_with_column_invalid_expr`, `test_with_columns_invalid_expr`,
`test_join_on_invalid_expr`, `test_aggregate_invalid_aggs`,
`test_order_by_string_equivalence`, and file sort-order tests for
`register_parquet`, `register_listing_table`, and `read_parquet`.
* `python/tests/test_expr.py` — tests for `ensure_expr` and
`ensure_expr_list` behavior and error messages.
## Are there any user-facing changes?
Yes.
**Behavioral / UX changes**
* Methods that naturally take column identifiers now accept plain `str`
values (for convenience):
* `DataFrame.sort("col")`
* `DataFrame.aggregate(group_by="col", ...)` (and group keys can be
sequences/tuples of strings)
* Many `order_by` parameters in window and aggregate functions now accept
string column names.
* `SessionContext` file sort-order metadata accepts column name strings.
* Methods that inherently require expressions (computation or predicate) now
enforce `Expr` values and raise a clear `TypeError` directing the user to
`col()`/`column()` or `lit()`/`literal()`:
* `DataFrame.filter` — must pass an `Expr` (e.g. `col("x") > lit(1)`).
* `DataFrame.with_column`, `with_columns` — items must be `Expr` objects.
* `DataFrame.join_on` — ON predicates must be `Expr` (equality or other
expressions).
**Documentation**
* The user guide now contains a new section "String Columns and Expressions"
explaining the above and showing usage examples. This should reduce confusion
for users migrating from other libraries like Polars and make the library
semantics explicit.
**Compatibility / API surface**
* This is backwards-compatible for most users: previously-accepted code that
used strings where allowed will continue to work. Code that accidentally passed
strings to expression-only APIs will now be rejected with clearer errors
instead of producing surprising behavior.
* A new type alias `SortKey` was introduced in the codebase to express the
union of accepted types for sort-like parameters. This is an internal typing
convenience and should not affect external users directly.
## Example usage
```py
from datafusion import col, lit, functions as f
# Allowed: passing a column name string to sort or aggregate grouping
df.sort("id")
df.aggregate("id", [f.count(col("value"))])
# Required: expressions for predicate/transform
# Bad: df.filter("age > 21") # raises TypeError
# Good:
df.filter(col("age") > lit(21))
# Window example: order_by accepts string
f.first_value(col("a"), order_by="ts")
```
## Notes for reviewers
* Focus review on the helpers in `expr.py` (`ensure_expr`,
`ensure_expr_list`, `_to_raw_expr`) and the conversions in `dataframe.py` and
`context.py` that call them. These are the core safety/validation changes.
* The bulk of the diff is tests and doc changes which should be reviewed for
correctness and clarity.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]