window and stricter Expr validation [datafusion-python]

via GitHub Tue, 02 Sep 2025 03:01:58 -0700


kosiew opened a new pull request, #1221:
URL: https://github.com/apache/datafusion-python/pull/1221


   ## Which issue does this PR close?
   
   * Closes #1214
   
   ## Rationale for this change
   
   Users reported inconsistent behavior when using plain string column 
identifiers vs. explicit `col()`/`column()` and `lit()`/`literal()` 
expressions. Some `DataFrame` methods accepted bare strings (e.g. `select`, 
`sort`, `drop`), while expression-returning APIs (e.g. `filter`, `with_column`) 
required `Expr` objects. This led to confusion and surprising runtime errors.
   
   This PR clarifies and documents the intended behavior by:
   
   * Accepting plain string column identifiers where passing a column *name* is 
natural (e.g. `sort`, `aggregate` grouping keys, many `order_by` parameters, 
file sort metadata).
   * Requiring explicit `Expr` objects for APIs that expect an *expression* 
(e.g. `filter`, `with_column`, `with_columns` elements, `join_on` predicates). 
When a non-`Expr` is passed to such APIs, a clear `TypeError` message guides 
the user to use `col()`/`column()` or `lit()`/`literal()`.
   * Updating documentation and tests to reflect the semantics and show 
examples.
   
   This makes usages consistent and explicit while preserving ergonomics where 
obvious (accepting column-name strings).
   
   ## What changes are included in this PR?
   
   **High-level summary**
   
   * Add robust helpers to validate and convert user-provided values:
   
     * `ensure_expr(value)` — validate and return the internal expression 
object or raise `TypeError` with a helpful message.
     * `ensure_expr_list(iterable)` — flatten and validate nested iterables of 
`Expr` objects.
     * `_to_raw_expr(value)` — convert an `Expr` or `str` column name to the 
internal raw expression.
     * Introduce `SortKey = Expr | SortExpr | str` type alias to represent 
items accepted by sort-like APIs.
   * Allow column-name strings in many APIs that logically take column 
identifiers: `DataFrame.sort`, `DataFrame.aggregate` (group-by keys), 
`Window.order_by`, various `functions.*` `order_by` parameters, and 
`SessionContext` file sort-order metadata.
   * Enforce explicit `Expr` values in expression-only APIs and provide clear 
error messages referencing `col()/column()` and `lit()/literal()` helpers.
   * Update Python doc (`docs/source/user-guide/dataframe/index.rst`) to 
explain which methods accept strings and which require `col()`/`lit()`.
   * Add tests covering string acceptance and error cases (extensive additions 
to `python/tests/test_dataframe.py` and `python/tests/test_expr.py`).
   
   **Files changed (high level)**
   
   * `python/datafusion/expr.py`
   
     * Added helpers: `ensure_expr`, `ensure_expr_list`, `_to_raw_expr`.
     * Made `expr_list_to_raw_expr_list` accept `str` in addition to `Expr` and 
convert accordingly.
     * Made `sort_list_to_raw_sort_list` accept `str` and `SortKey` and convert 
entries to raw `SortExpr`.
     * Added `SortKey` alias and `EXPR_TYPE_ERROR` constant for consistent 
error messages.
   
   * `python/datafusion/dataframe.py`
   
     * Use `ensure_expr`/`ensure_expr_list` to validate expressions in 
`filter`, `with_column(s)`, `join_on`, `aggregate`, and other places.
     * Allow column name strings for `sort`, `aggregate` group-by, and related 
APIs via conversions.
     * Added user-facing docstrings clarifying expected types for parameters 
and examples.
   
   * `python/datafusion/context.py`
   
     * Accept `file_sort_order` as nested `SortKey` sequences and added 
`_convert_file_sort_order` to create the low-level representation for the Rust 
bindings.
     * Import the internal `expr` namespace as `expr_internal` where required.
   
   * `python/datafusion/functions.py`
   
     * Updated numerous function signatures to accept `SortKey` for `order_by` 
parameters.
     * Updated docstrings and added usage examples where helpful.
   
   * `docs/source/user-guide/dataframe/index.rst`
   
     * New section: "String Columns and Expressions" documenting which methods 
accept plain strings and which require explicit `col()`/`lit()` expressions, 
with examples.
   
   * `python/tests/test_dataframe.py`, `python/tests/test_expr.py`
   
     * Added many tests exercising both permissive string-accepting behavior 
and strict expression validation.
   
   **Notable implementation details**
   
   * When a user supplies a plain string for sort/aggregate/window `order_by` 
or `group_by`, we convert the string into `Expr.column(name)` prior to calling 
the lower-level bindings.
   * For APIs where an expression is required (e.g., `filter`, `with_column`, 
`join_on`), passing a plain string now raises `TypeError` with the message: 
`Use col()/column() or lit()/literal() to construct expressions` (exposed via 
`EXPR_TYPE_ERROR`). Tests assert on that message to ensure consistent behavior.
   
   ## Are these changes tested?
   
   Yes. This PR adds and updates unit tests to cover:
   
   * Accepting column-name strings for `sort`, `aggregate` group-by, 
`window.order_by`, `array_agg(..., order_by=...)`, 
`first_value/last_value/nth_value(..., order_by=...)`, `lead/lag`, 
`row_number`/`rank`/`dense_rank`/`percent_rank`/`cume_dist`/`ntile`, and 
`SessionContext` file sort order conversions.
   * Error paths for passing non-`Expr` types to expression-only APIs 
(`filter`, `with_column`, `with_columns`, `join_on`, etc.).
   * Tests ensure that string-based usages are equivalent to their `col("...")` 
counterparts when allowed.
   
   Specifically modified/added tests include (non-exhaustive list):
   
   * `python/tests/test_dataframe.py` — many new tests: 
`test_select_unsupported`, `test_sort_string_and_expression_equivalent`, 
`test_sort_unsupported`, `test_aggregate_string_and_expression_equivalent`, 
`test_aggregate_tuple_group_by`, `test_filter_string_unsupported`, 
`test_with_column_invalid_expr`, `test_with_columns_invalid_expr`, 
`test_join_on_invalid_expr`, `test_aggregate_invalid_aggs`, 
`test_order_by_string_equivalence`, and file sort-order tests for 
`register_parquet`, `register_listing_table`, and `read_parquet`.
   * `python/tests/test_expr.py` — tests for `ensure_expr` and 
`ensure_expr_list` behavior and error messages.
   
   ## Are there any user-facing changes?
   
   Yes.
   
   **Behavioral / UX changes**
   
   * Methods that naturally take column identifiers now accept plain `str` 
values (for convenience):
   
     * `DataFrame.sort("col")`
     * `DataFrame.aggregate(group_by="col", ...)` (and group keys can be 
sequences/tuples of strings)
     * Many `order_by` parameters in window and aggregate functions now accept 
string column names.
     * `SessionContext` file sort-order metadata accepts column name strings.
   
   * Methods that inherently require expressions (computation or predicate) now 
enforce `Expr` values and raise a clear `TypeError` directing the user to 
`col()`/`column()` or `lit()`/`literal()`:
   
     * `DataFrame.filter` — must pass an `Expr` (e.g. `col("x") > lit(1)`).
     * `DataFrame.with_column`, `with_columns` — items must be `Expr` objects.
     * `DataFrame.join_on` — ON predicates must be `Expr` (equality or other 
expressions).
   
   **Documentation**
   
   * The user guide now contains a new section "String Columns and Expressions" 
explaining the above and showing usage examples. This should reduce confusion 
for users migrating from other libraries like Polars and make the library 
semantics explicit.
   
   **Compatibility / API surface**
   
   * This is backwards-compatible for most users: previously-accepted code that 
used strings where allowed will continue to work. Code that accidentally passed 
strings to expression-only APIs will now be rejected with clearer errors 
instead of producing surprising behavior.
   * A new type alias `SortKey` was introduced in the codebase to express the 
union of accepted types for sort-like parameters. This is an internal typing 
convenience and should not affect external users directly.
   
   ## Example usage
   
   ```py
   from datafusion import col, lit, functions as f
   
   # Allowed: passing a column name string to sort or aggregate grouping
   df.sort("id")
   df.aggregate("id", [f.count(col("value"))])
   
   # Required: expressions for predicate/transform
   # Bad: df.filter("age > 21")  # raises TypeError
   # Good:
   df.filter(col("age") > lit(21))
   
   # Window example: order_by accepts string
   f.first_value(col("a"), order_by="ts")
   ```
   
   ## Notes for reviewers
   
   * Focus review on the helpers in `expr.py` (`ensure_expr`, 
`ensure_expr_list`, `_to_raw_expr`) and the conversions in `dataframe.py` and 
`context.py` that call them. These are the core safety/validation changes.
   * The bulk of the diff is tests and doc changes which should be reviewed for 
correctness and clarity.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Support string column identifiers for sort/aggregate/window and stricter Expr validation [datafusion-python]

Reply via email to