kosiew opened a new pull request, #1243:
URL: https://github.com/apache/datafusion-python/pull/1243
## Which issue does this PR close?
* Closes #1239.
---
## Rationale for this change
This change unifies how table-like objects are represented and registered in
DataFusion's Python bindings. Historically there were multiple ad-hoc ways to
register tables (direct `Table` objects, FFI pycapsules exposed by Rust
providers, `DataFrame` views, and the `register_table_provider` API). That
fragmentation made the code harder to maintain, made FFI integration awkward,
and caused subtle API surface inconsistencies.
This patch introduces a single, high-level `TableProvider` wrapper in Python
(backed by a `PyTableProvider` Rust type) and centralizes the logic that
coerces various supported inputs into a concrete provider. It also:
* Makes `SessionContext.register_table(...)` the single, preferred
entrypoint for table registration.
* Deprecates `SessionContext.register_table_provider(...)` in favor of
`register_table` while preserving backward compatibility (it forwards to
`register_table` and emits a `DeprecationWarning`).
* Adds utilities to normalize/coerce supported inputs (native `Table`, the
new `TableProvider` wrapper, PyCapsule-based foreign providers, and PyArrow
datasets) into the expected Rust `TableProvider` implementation.
Overall this reduces duplication, clarifies documentation and examples, and
provides a clearer path for FFI authors to expose table providers to Python.
---
## What changes are included in this PR?
**High-level summary**
* New Python public API: `datafusion.TableProvider` wrapper
(python/datafusion/table\_provider.py)
* New Rust `PyTableProvider` type and module (src/table.rs)
exposing/from-capsule/from-dataframe helpers and `__datafusion_table_provider__`
* Centralized coercion helpers on the Rust side: `coerce_table_provider` and
`table_provider_from_pycapsule` (src/utils.rs)
* New Python helper utilities: `datafusion.utils._normalize_table_provider`
(python/datafusion/utils.py)
* Update `SessionContext.register_table(...)` to accept `Table |
TableProvider | objects exporting __datafusion_table_provider__` (Python + Rust)
* Deprecate `register_table_provider(...)` and `TableProvider.from_view()`
(Python + Rust) with warnings, while preserving behavior by delegating to new
API where appropriate.
* Make `DataFrame.into_view()` return a `TableProvider` (Python) and return
`PyTableProvider` from Rust `into_view`.
* Export a helpful error message constant `EXPECTED_PROVIDER_MSG` to give
clearer errors when users pass unsupported objects.
* Update docs and user-guide examples to use `TableProvider` +
`register_table`.
* Add/modify tests to cover the new APIs and coercion rules.
* Changelog entry documenting the deprecation of
`SessionContext.register_table_provider`.
**Files added**
* `python/datafusion/table_provider.py` — high-level Python wrapper around
the internal table provider.
* `python/datafusion/utils.py` — helper `_normalize_table_provider` and
pyarrow dataset handling.
* `src/table.rs` — `PyTableProvider` Rust implementation.
**Files modified (representative, not exhaustive)**
* Python: `__init__.py`, `catalog.py`, `context.py`, `dataframe.py`,
`io/table_provider.rst`, `data-sources.rst`, examples and tests under
`examples/` and `python/tests/`.
* Rust: `src/utils.rs`, `src/catalog.rs`, `src/context.rs`,
`src/dataframe.rs`, `src/udtf.rs`, `src/lib.rs`, and other modules adjusted to
use the new table provider helpers.
**Behavioral changes**
* `SessionContext.register_table(name, table)` now accepts:
* `datafusion.catalog.Table` (existing behavior preserved),
* `datafusion.TableProvider` (new wrapper),
* Objects exporting `__datafusion_table_provider__()` (pycapsule-based FFI
providers),
* `pyarrow.dataset.Dataset` instances.
* `SessionContext.register_table_provider(...)` is deprecated and will warn;
it forwards to `register_table` for backwards compatibility.
* `TableProvider.from_view()` is deprecated in favor of
`DataFrame.into_view()` and `TableProvider.from_dataframe()`; calling the
deprecated method emits a `DeprecationWarning`.
* `DataFrame.into_view()` now returns a `TableProvider` wrapper rather than
the older internal table representation exposed directly to Python.
* A common, clearer error message (`EXPECTED_PROVIDER_MSG`) is provided and
exported for tests and user-facing errors.
---
## Are these changes tested?
Yes — the PR includes unit and integration test updates and additions in
`python/tests/` to cover:
* Registering a table from a `TableProvider` created via `from_capsule`,
`from_dataframe`, and via `DataFrame.into_view()`.
* Registering PyArrow `Dataset` objects via `Schema.register_table` and
`SessionContext.register_table`.
* Ensuring `DataFrame` objects raise a clear `TypeError` when passed
directly to `register_table` (guiding users to `into_view()` /
`from_dataframe()`).
* Tests asserting proper `DeprecationWarning` behavior for `from_view` and
`register_table_provider`.
If any tests still need to be added, they should exercise cross-language FFI
flows (Rust-provided pycapsule -> Python `TableProvider.from_capsule` ->
`register_table`).
---
## Are there any user-facing changes?
Yes.
**API additions / changes**
* New public API: `datafusion.TableProvider` (Python).
* `DataFrame.into_view()` returns a `TableProvider` (Python).
* `SessionContext.register_table(name, table)` accepts broader inputs and is
the canonical registration API.
* `SessionContext.register_table_provider` is deprecated (will emit
`DeprecationWarning` and forward to `register_table`).
* `TableProvider.from_view()` is deprecated in favor of
`DataFrame.into_view()` and `TableProvider.from_dataframe()`.
* A new exported constant `datafusion._internal.EXPECTED_PROVIDER_MSG` (and
re-exported as `datafusion.EXPECTED_PROVIDER_MSG`) provides a stable error
message for consumers and tests.
**Documentation**
* User guide snippets and examples updated to show the new `TableProvider`
and `register_table` usage patterns.
* A changelog deprecation entry has been added.
**Compatibility**
* Backwards compatibility is preserved where feasible: existing code that
calls `register_table_provider()` will continue to work but will receive a
deprecation warning.
* Users passing `DataFrame` objects directly to `register_table` will now
get a clear error directing them to `into_view()`/`from_dataframe()`.
**Breaking changes**
* This PR is designed to be minimally breaking. It intentionally deprecates
rather than removes prior APIs and issues `DeprecationWarning`s. However, code
that relied on internal implementation details of the old table provider
representation (rather than the stable public APIs) may require updates.
---
### Notes for reviewers
* Focus on the coercion logic (`coerce_table_provider` /
`_normalize_table_provider`): does it accept the right set of inputs and
provide clear errors? Are there additional types we should accept?
* Verify deprecation warning messaging and stacklevels to ensure they point
at user code rather than library internals.
* Confirm the documentation examples and user-guide reflect the recommended
patterns (using `TableProvider` + `register_table`).
* Ensure the exported `EXPECTED_PROVIDER_MSG` wording is acceptable and
stable for users and tests.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]