andygrove opened a new pull request, #1614: URL: https://github.com/apache/datafusion-ballista/pull/1614
# Which issue does this PR close? Closes #. # Rationale for this change Ballista's Python bindings extend `datafusion-python` heavily through subclassing and metaclass introspection (see `python/python/ballista/extension.py`): - `RedefiningDataFrameMeta` walks the parent `DataFrame.__dict__` and re-wraps every method whose return annotation is the literal string `"DataFrame"` so it returns `DistributedDataFrame` instead. - `RedefiningSessionContextMeta` does the same for `SessionContext`. - A hardcoded `EXECUTION_METHODS = ["collect", "collect_partitioned", "show", "count", "to_arrow_table", "to_pandas", "to_polars", "write_json"]` is wrapped to route execution through the Ballista cluster. If a future `datafusion-python` release changes annotation style (e.g. switches from forward-reference strings to real class objects, or to PEP 604 unions) or renames any of those methods, the wrapping silently stops happening. Queries quietly fall back to local DataFusion execution while every existing test still passes — the failure mode is invisible until users notice their cluster doing nothing. Today only `collect()` is exercised under Ballista in `test_context.py`. Nothing asserts that wrapping actually occurred, and the other seven `EXECUTION_METHODS` are entirely uncovered. # What changes are included in this PR? New file `python/python/tests/test_datafusion_compat.py` with 11 tests in three groups: **Metaclass smoke tests (3)** — fail loudly if introspection no longer matches: - `test_distributed_dataframe_wraps_dataframe_returning_methods` — confirms representative `DataFrame` methods (`select`, `filter`, `with_column`, `aggregate`) carry the string `"DataFrame"` return annotation and are re-wrapped on `DistributedDataFrame`. - `test_ballista_session_context_wraps_dataframe_returning_methods` — same check for `sql` / `read_csv` / `read_parquet` on `BallistaSessionContext`. - `test_execution_methods_are_present_on_dataframe` — every name in `EXECUTION_METHODS` still exists on `datafusion.DataFrame`. **Per-method round-trip tests (8)** — one per name in `EXECUTION_METHODS`. Builds a small `DistributedDataFrame` and calls `collect`, `collect_partitioned`, `show`, `count`, `to_arrow_table`, `to_pandas`, `to_polars`, and `write_json`, asserting return shape and content. Catches both renames (loud `AttributeError`) and silent fallback (return type would be wrong). **Dev dependency additions** — `pandas>=2.0.0` and `polars>=1.0.0` added to `[dependency-groups].dev` in `python/pyproject.toml` so the `to_pandas` / `to_polars` tests run unconditionally in CI rather than skipping when those libraries are absent. `uv.lock` is regenerated accordingly. Note: `write_json` requires its `write_options` argument to be passed explicitly even though datafusion's signature declares it optional with a `None` default — captured in a comment in the test. # Are there any user-facing changes? No. New tests and dev dependencies only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
