timsaucer opened a new pull request, #1554: URL: https://github.com/apache/datafusion-python/pull/1554
# Which issue does this PR close? No single issue — this is wave 1 of follow-up work after the DataFusion 54 upgrade (#1532). Each commit is self-contained and can be reviewed independently. # Rationale for this change DataFusion 54 introduced or deprecated several pieces of upstream API surface that the Python bindings had not yet caught up with. This PR closes the highest-value gaps (UDF lookup, `read_batches`, variadic `get_field`), tightens typing on the codec setters added in #1541, migrates the FFI example off the deprecated `TableFunctionImpl::call`, and cleans up two long-standing pytest annoyances (an `xfail` that no longer needs to fail, and a deprecation warning that was leaking through `pytest.raises`). # What changes are included in this PR? - `refactor: migrate FFI example table function to call_with_args` — `PyTableFunction` already moved to `call_with_args` in 5a64b0d; this brings the FFI example along so it no longer relies on the deprecated entry point. - `feat: type SessionContext codec setters with exportable Protocols` — adds `LogicalExtensionCodecExportable` / `PhysicalExtensionCodecExportable` Protocols and tightens `with_logical_extension_codec` / `with_physical_extension_codec` signatures from `codec: Any` to `Protocol | _PyCapsule`. Pure typing change; no runtime behavior diff. - `feat: accept variadic field path in get_field` — collapses `get_field(expr, name)` and `get_field_path(expr, [names...])` into a single variadic `get_field(expr, *names)` that dispatches through one Rust binding. - `feat: SessionContext.read_batches / read_batch` — wraps upstream `SessionContext::read_batches` to materialize a DataFrame directly from a sequence of `RecordBatch`es without registering a named table. The single-batch `read_batch` is implemented in pure Python on top of `read_batches([batch])`. - `feat: SessionContext UDF lookup helpers` — exposes `udf(name)` / `udaf(name)` / `udwf(name)` lookups symmetric with the existing register helpers, plus `udfs()` / `udafs()` / `udwfs()` enumerators that return sorted `Vec<String>` instead of the raw upstream `HashSet`. - `chore: bump pre-commit so it stops failing CI checks`. - `test: drop xfail on timestamp[s] parquet roundtrip` — pyarrow.parquet promotes `timestamp[s]` to `timestamp[ms]` on write ([apache/arrow#41382](https://github.com/apache/arrow/issues/41382)); cast the expected array so the test asserts DataFusion reads what Arrow actually stored, instead of relying on `xfail`. - `test: capture deprecation warning in repr_rows conflict case` — `DataFrameHtmlFormatter(repr_rows=..., max_rows=...)` fires the deprecation warning before raising `ValueError`, but `pytest.raises` does not catch warnings. Wrap the call in both `pytest.raises` and `pytest.warns` so the warning is asserted, not leaked into every pytest run. # Are there any user-facing changes? Yes — several new public APIs: - `SessionContext.read_batches(batches)` / `SessionContext.read_batch(batch)` — materialize a DataFrame directly from `RecordBatch`es. - `SessionContext.udf(name)` / `udaf(name)` / `udwf(name)` lookup helpers, and `udfs()` / `udafs()` / `udwfs()` enumerators. - `get_field(expr, *names)` now accepts a variadic field path (single-name calls are unchanged). - `with_logical_extension_codec` / `with_physical_extension_codec` setters are now typed as `Protocol | _PyCapsule` instead of `Any`; runtime behavior is unchanged. No breaking changes to existing public APIs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
