paleolimbot commented on code in PR #964: URL: https://github.com/apache/datafusion-python/pull/964#discussion_r2433521653
########## docs/source/user-guide/sql.rst: ########## @@ -23,17 +23,100 @@ DataFusion also offers a SQL API, read the full reference `here <https://arrow.a .. ipython:: python import datafusion - from datafusion import col - import pyarrow + from datafusion import DataFrame, SessionContext # create a context ctx = datafusion.SessionContext() # register a CSV - ctx.register_csv('pokemon', 'pokemon.csv') + ctx.register_csv("pokemon", "pokemon.csv") # create a new statement via SQL df = ctx.sql('SELECT "Attack"+"Defense", "Attack"-"Defense" FROM pokemon') # collect and convert to pandas DataFrame - df.to_pandas() \ No newline at end of file + df.to_pandas() + +Parameterized queries +--------------------- + +In DataFusion-Python 51.0.0 we introduced the ability to pass parameters +in a SQL query. These are similar in concept to +`prepared statements <https://datafusion.apache.org/user-guide/sql/prepared_statements.html>`_, +but allow passing named parameters into a SQL query. Consider this simple +example. + +.. ipython:: python + + def show_attacks(ctx: SessionContext, threshold: int) -> None: + ctx.sql( + 'SELECT "Name", "Attack" FROM pokemon WHERE "Attack" > $val', val=threshold + ).show(num=5) + show_attacks(ctx, 75) + +When passing parameters like the example above we convert the Python objects +into their string representation. We also have special case handling +for :py:class:`~datafusion.dataframe.DataFrame` objects, since they cannot simply +be turned into string representations for an SQL query. In these cases we +will register a temporary view in the :py:class:`~datafusion.context.SessionContext` +using a generated table name. + +The formatting for passing string replacement objects is to precede the +variable name with a single ``$``. This works for all dialects in +the SQL parser except ``hive`` and ``mysql``. Since these dialects do not +support named placeholders, we are unable to do this type of replacement. +We recommend either switching to another dialect or using Python +f-string style replacement. + +.. warning:: + + To support DataFrame parameterized queries, your session must support + registration of temporary views. The default + :py:class:`~datafusion.catalog.CatalogProvider` and + :py:class:`~datafusion.catalog.SchemaProvider` do have this capability. + If you have implemented custom providers, it is important that temporary + views do not persist across :py:class:`~datafusion.context.SessionContext` + or you may get unintended consequences. + +The following example shows passing in both a :py:class:`~datafusion.dataframe.DataFrame` +object as well as a Python object to be used in parameterized replacement. + +.. ipython:: python + + def show_column( + ctx: SessionContext, column: str, df: DataFrame, threshold: int + ) -> None: + ctx.sql( + 'SELECT "Name", $col FROM $df WHERE $col > $val', Review Comment: Another mechanism you could use instead of a parameter here is what DuckDB would call a replacement scan...in DuckDB you can `select * from foo` where `foo` can be a variable in the Python frame from whence `.sql()` was called. I believe you could implement that as a Catalog where fetching a table of name `foo` would look for a Python variable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
