timsaucer commented on code in PR #1243: URL: https://github.com/apache/datafusion-python/pull/1243#discussion_r2366188433
########## docs/source/user-guide/io/table_provider.rst: ########## @@ -39,20 +39,47 @@ A complete example can be found in the `examples folder <https://github.com/apac ) -> PyResult<Bound<'py, PyCapsule>> { let name = CString::new("datafusion_table_provider").unwrap(); - let provider = Arc::new(self.clone()) - .map_err(|e| PyRuntimeError::new_err(e.to_string()))?; - let provider = FFI_TableProvider::new(Arc::new(provider), false); + let provider = Arc::new(self.clone()); + let provider = FFI_TableProvider::new(provider, false, None); PyCapsule::new_bound(py, provider, Some(name.clone())) } } -Once you have this library available, in python you can register your table provider -to the ``SessionContext``. +Once you have this library available, you can construct a +:py:class:`~datafusion.TableProvider` in Python and register it with the +``SessionContext``. Table providers can be created either from the PyCapsule exposed by +your Rust provider or from an existing :py:class:`~datafusion.dataframe.DataFrame`. +Call the provider's ``__datafusion_table_provider__()`` method to obtain the capsule +before constructing a ``TableProvider``. The ``TableProvider.from_view()`` helper is +deprecated; instead use ``TableProvider.from_dataframe()`` or ``DataFrame.into_view()``. + +.. note:: + + :py:meth:`~datafusion.context.SessionContext.register_table_provider` is + deprecated. Use + :py:meth:`~datafusion.context.SessionContext.register_table` with the + resulting :py:class:`~datafusion.TableProvider` instead. .. code-block:: python + from datafusion import SessionContext, TableProvider + + ctx = SessionContext() provider = MyTableProvider() - ctx.register_table_provider("my_table", provider) - ctx.table("my_table").show() + capsule = provider.__datafusion_table_provider__() + capsule_provider = TableProvider.from_capsule(capsule) + + df = ctx.from_pydict({"a": [1]}) + view_provider = TableProvider.from_dataframe(df) + # or: view_provider = df.into_view() + + ctx.register_table("capsule_table", capsule_provider) + ctx.register_table("view_table", view_provider) + + ctx.table("capsule_table").show() + ctx.table("view_table").show() Review Comment: This example takes a bit of cognitive load to understand what we're demonstrating. First off, similar to my comments above I don't think we want our users to have to think about if they're using something that comes from a PyCapsule interface or not. Suppose I am a library user and I get a delta table object that implements PyCapsule. As a user of that library, I shouldn't have to understand how the interfacing works. I should just be able to use it directly. So I want to be able to just pass those objects directly to `TableProvider` or `register_table` without having to think about or understand these mechanics behind the scene. ########## dev/changelog/49.0.0.md: ########## @@ -25,6 +25,10 @@ This release consists of 16 commits from 7 contributors. See credits at the end - fix(build): Include build.rs in published crates [#1199](https://github.com/apache/datafusion-python/pull/1199) (colinmarc) +**Deprecations:** + +- Document that `SessionContext.register_table_provider` is deprecated in favor of `SessionContext.register_table`. + Review Comment: These changelogs are automatically generated, so I don't think we want to make changes here. Regardless, these would go into the 51.0.0 release. ########## docs/source/user-guide/data-sources.rst: ########## @@ -152,13 +152,22 @@ as Delta Lake. This will require a recent version of .. code-block:: python from deltalake import DeltaTable + from datafusion import TableProvider delta_table = DeltaTable("path_to_table") - ctx.register_table_provider("my_delta_table", delta_table) + provider = TableProvider.from_capsule(delta_table.__datafusion_table_provider__()) + ctx.register_table("my_delta_table", provider) Review Comment: This feels like a worse experience than before. Why can we not just call `register_table("my_delta_table", delta_table)`? ########## python/datafusion/__init__.py: ########## @@ -21,24 +21,26 @@ See https://datafusion.apache.org/python for more information. """ +# isort: skip_file # Prevent import-sorting linter errors (I001) +# ruff: noqa: I001 Review Comment: Is this ruff lint causing a problem? ########## docs/source/conf.py: ########## @@ -94,6 +97,10 @@ def autoapi_skip_member_fn(app, what, name, obj, skip, options) -> bool: # noqa if (what, name) in skip_contents: skip = True + # Skip private members that start with underscore to avoid duplication + if name.split(".")[-1].startswith("_") and what in ("data", "variable"): + skip = True + Review Comment: So I can understand better, why do we need both this rule and the one above in lines 86-88? ########## python/datafusion/dataframe.py: ########## @@ -313,9 +315,29 @@ def __init__(self, df: DataFrameInternal) -> None: """ self.df = df - def into_view(self) -> pa.Table: - """Convert DataFrame as a ViewTable which can be used in register_table.""" - return self.df.into_view() + def into_view(self) -> TableProvider: + """Convert ``DataFrame`` into a ``TableProvider`` view for registration. + + This is the preferred way to obtain a view for + :py:meth:`~datafusion.context.SessionContext.register_table`. Review Comment: I don't understand this statement. ########## python/datafusion/dataframe.py: ########## @@ -313,9 +315,29 @@ def __init__(self, df: DataFrameInternal) -> None: """ self.df = df - def into_view(self) -> pa.Table: - """Convert DataFrame as a ViewTable which can be used in register_table.""" - return self.df.into_view() + def into_view(self) -> TableProvider: + """Convert ``DataFrame`` into a ``TableProvider`` view for registration. + + This is the preferred way to obtain a view for + :py:meth:`~datafusion.context.SessionContext.register_table`. + ``datafusion.TableProvider.from_dataframe`` calls this method under the hood, + and the older ``TableProvider.from_view`` helper is deprecated. + + The ``DataFrame`` remains valid after conversion, so it can still be used for + additional queries alongside the returned view. + + Examples: + >>> from datafusion import SessionContext + >>> ctx = SessionContext() + >>> df = ctx.sql("SELECT 1 AS value") + >>> provider = df.into_view() Review Comment: From an end user's perspective, they turn a dataframe into a view, which they then register so they can use it later. I don't think this end user needs to understand the concept of TableProvider at all. In the example I would change the variable name `provider` to `view` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
