timsaucer commented on code in PR #1221:
URL: 
https://github.com/apache/datafusion-python/pull/1221#discussion_r2347296608


##########
docs/source/user-guide/dataframe/index.rst:
##########
@@ -126,6 +126,53 @@ DataFusion's DataFrame API offers a wide range of 
operations:
     # Drop columns
     df = df.drop("temporary_column")
 
+String Columns and Expressions
+------------------------------
+
+Some ``DataFrame`` methods accept plain strings when an argument refers to an

Review Comment:
   recommend "plain strings" -> "column names"



##########
docs/source/user-guide/dataframe/index.rst:
##########
@@ -126,6 +126,53 @@ DataFusion's DataFrame API offers a wide range of 
operations:
     # Drop columns
     df = df.drop("temporary_column")
 
+String Columns and Expressions
+------------------------------
+
+Some ``DataFrame`` methods accept plain strings when an argument refers to an
+existing column. These include:
+
+* :py:meth:`~datafusion.DataFrame.select`
+* :py:meth:`~datafusion.DataFrame.sort`
+* :py:meth:`~datafusion.DataFrame.drop`
+* :py:meth:`~datafusion.DataFrame.join` (``on`` argument)
+* :py:meth:`~datafusion.DataFrame.aggregate` (grouping columns)
+
+Note that :py:meth:`~datafusion.DataFrame.join_on` expects 
``col()``/``column()`` expressions rather than plain strings.
+
+For such methods, you can pass column names directly:
+
+.. code-block:: python
+
+    from datafusion import col, functions as f
+
+    df.sort('id')
+    df.aggregate('id', [f.count(col('value'))])
+
+The same operation can also be written with explicit column expressions, using 
either ``col()`` or ``column()``:
+
+.. code-block:: python
+
+    from datafusion import col, column, functions as f
+
+    df.sort(col('id'))
+    df.aggregate(column('id'), [f.count(col('value'))])
+
+Note that ``column()`` is an alias of ``col()``, so you can use either name; 
the example above shows both in action.
+
+Whenever an argument represents an expression—such as in
+:py:meth:`~datafusion.DataFrame.filter` or
+:py:meth:`~datafusion.DataFrame.with_column`—use ``col()`` to reference columns
+and wrap constant values with ``lit()`` (also available as ``literal()``):
+
+.. code-block:: python
+
+    from datafusion import col, lit
+    df.filter(col('age') > lit(21))
+
+Without ``lit()`` DataFusion would treat ``21`` as a column name rather than a
+constant value.

Review Comment:
   Is this statement true? `df.filter(col('age') > 21)`  would treat `21` as a 
column name? I think that's a change in how the comparison operator works. 



##########
python/datafusion/expr.py:
##########
@@ -212,32 +224,119 @@
     "WindowExpr",
     "WindowFrame",
     "WindowFrameBound",
+    "ensure_expr",
+    "ensure_expr_list",
 ]
 
 
+def ensure_expr(value: _typing.Union[Expr, Any]) -> expr_internal.Expr:
+    """Return the internal expression from ``Expr`` or raise ``TypeError``.
+
+    This helper rejects plain strings and other non-:class:`Expr` values so
+    higher level APIs consistently require explicit :func:`~datafusion.col` or
+    :func:`~datafusion.lit` expressions.
+
+    Args:
+        value: Candidate expression or other object.
+
+    Returns:
+        The internal expression representation.
+
+    Raises:
+        TypeError: If ``value`` is not an instance of :class:`Expr`.
+    """
+    if not isinstance(value, Expr):
+        raise TypeError(EXPR_TYPE_ERROR)
+    return value.expr
+
+
+def ensure_expr_list(
+    exprs: Iterable[_typing.Union[Expr, Iterable[Expr]]],
+) -> list[expr_internal.Expr]:
+    """Flatten an iterable of expressions, validating each via ``ensure_expr``.
+
+    Args:
+        exprs: Possibly nested iterable containing expressions.
+
+    Returns:
+        A flat list of raw expressions.
+
+    Raises:
+        TypeError: If any item is not an instance of :class:`Expr`.
+    """
+
+    def _iter(
+        items: Iterable[_typing.Union[Expr, Iterable[Expr]]],
+    ) -> Iterable[expr_internal.Expr]:
+        for expr in items:
+            if isinstance(expr, Iterable) and not isinstance(
+                expr, (Expr, str, bytes, bytearray)
+            ):
+                # Treat string-like objects as atomic to surface standard 
errors
+                yield from _iter(expr)
+            else:
+                yield ensure_expr(expr)
+
+    return list(_iter(exprs))
+
+
+def _to_raw_expr(value: _typing.Union[Expr, str]) -> expr_internal.Expr:
+    """Convert a Python expression or column name to its raw variant.
+
+    Args:
+        value: Candidate expression or column name.
+
+    Returns:
+        The internal :class:`~datafusion._internal.expr.Expr` representation.
+
+    Raises:
+        TypeError: If ``value`` is neither an :class:`Expr` nor ``str``.
+    """
+    if isinstance(value, str):
+        return Expr.column(value).expr
+    if isinstance(value, Expr):
+        return value.expr
+    error = (
+        "Expected Expr or column name, found:"
+        f" {type(value).__name__}. {EXPR_TYPE_ERROR}."
+    )
+    raise TypeError(error)
+
+
 def expr_list_to_raw_expr_list(
-    expr_list: Optional[list[Expr] | Expr],
+    expr_list: Optional[_typing.Union[Sequence[_typing.Union[Expr, str]], 
Expr, str]],

Review Comment:
   Why the change from `|` to `_typing.Union` ? I'm not completely against it 
but it seems we're inconsistent in our approach



##########
python/datafusion/dataframe.py:
##########
@@ -414,9 +418,17 @@ def filter(self, *predicates: Expr) -> DataFrame:
         """Return a DataFrame for which ``predicate`` evaluates to ``True``.
 
         Rows for which ``predicate`` evaluates to ``False`` or ``None`` are 
filtered
-        out.  If more than one predicate is provided, these predicates will be
-        combined as a logical AND. If more complex logic is required, see the
-        logical operations in :py:mod:`~datafusion.functions`.
+        out. If more than one predicate is provided, these predicates will be
+        combined as a logical AND. Each ``predicate`` must be an
+        :class:`~datafusion.expr.Expr` created using helper functions such as
+        :func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not
+        accepted. If more complex logic is required, see the logical 
operations in
+        :py:mod:`~datafusion.functions`.

Review Comment:
   I would remove this part that says "plain strings are not accepted". When 
you view it from the context of this PR and Issue, the statement makes sense. 
But if you are an external user coming across this function definition for the 
first time, I don't think the statement makes sense. I think you can just 
remove the change here.



##########
docs/source/user-guide/dataframe/index.rst:
##########
@@ -126,6 +126,53 @@ DataFusion's DataFrame API offers a wide range of 
operations:
     # Drop columns
     df = df.drop("temporary_column")
 
+String Columns and Expressions
+------------------------------
+
+Some ``DataFrame`` methods accept plain strings when an argument refers to an
+existing column. These include:

Review Comment:
   We should probably add a note to see the full function documentation for 
details on any specific function.



##########
python/datafusion/dataframe.py:
##########
@@ -426,36 +438,49 @@ def filter(self, *predicates: Expr) -> DataFrame:
         """
         df = self.df
         for p in predicates:
-            df = df.filter(p.expr)
+            df = df.filter(ensure_expr(p))
         return DataFrame(df)
 
     def with_column(self, name: str, expr: Expr) -> DataFrame:
         """Add an additional column to the DataFrame.
 
+        The ``expr`` must be an :class:`~datafusion.expr.Expr` constructed with
+        :func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not
+        accepted.

Review Comment:
   Same comment as before.



##########
docs/source/user-guide/dataframe/index.rst:
##########
@@ -126,6 +126,53 @@ DataFusion's DataFrame API offers a wide range of 
operations:
     # Drop columns
     df = df.drop("temporary_column")
 
+String Columns and Expressions

Review Comment:
   I think the title here is misleading. "String Columns" to me would mean 
columns that contain string values. I think maybe we should call this something 
like "Function arguments taking column names" or "Column names as function 
arguments"



##########
python/datafusion/dataframe.py:
##########
@@ -745,8 +766,15 @@ def join_on(
     ) -> DataFrame:
         """Join two :py:class:`DataFrame` using the specified expressions.
 
-        On expressions are used to support in-equality predicates. Equality
-        predicates are correctly optimized
+        Join predicates must be :class:`~datafusion.expr.Expr` objects, 
typically
+        built with :func:`datafusion.col`; plain strings are not accepted. On

Review Comment:
   Same comment as before



##########
python/datafusion/expr.py:
##########
@@ -212,32 +224,119 @@
     "WindowExpr",
     "WindowFrame",
     "WindowFrameBound",
+    "ensure_expr",
+    "ensure_expr_list",
 ]
 
 
+def ensure_expr(value: _typing.Union[Expr, Any]) -> expr_internal.Expr:
+    """Return the internal expression from ``Expr`` or raise ``TypeError``.
+
+    This helper rejects plain strings and other non-:class:`Expr` values so
+    higher level APIs consistently require explicit :func:`~datafusion.col` or
+    :func:`~datafusion.lit` expressions.
+
+    Args:
+        value: Candidate expression or other object.
+
+    Returns:
+        The internal expression representation.
+
+    Raises:
+        TypeError: If ``value`` is not an instance of :class:`Expr`.
+    """
+    if not isinstance(value, Expr):
+        raise TypeError(EXPR_TYPE_ERROR)
+    return value.expr
+
+
+def ensure_expr_list(
+    exprs: Iterable[_typing.Union[Expr, Iterable[Expr]]],
+) -> list[expr_internal.Expr]:
+    """Flatten an iterable of expressions, validating each via ``ensure_expr``.
+
+    Args:
+        exprs: Possibly nested iterable containing expressions.
+
+    Returns:
+        A flat list of raw expressions.
+
+    Raises:
+        TypeError: If any item is not an instance of :class:`Expr`.
+    """
+
+    def _iter(
+        items: Iterable[_typing.Union[Expr, Iterable[Expr]]],
+    ) -> Iterable[expr_internal.Expr]:
+        for expr in items:
+            if isinstance(expr, Iterable) and not isinstance(
+                expr, (Expr, str, bytes, bytearray)
+            ):
+                # Treat string-like objects as atomic to surface standard 
errors
+                yield from _iter(expr)
+            else:
+                yield ensure_expr(expr)
+
+    return list(_iter(exprs))
+
+
+def _to_raw_expr(value: _typing.Union[Expr, str]) -> expr_internal.Expr:
+    """Convert a Python expression or column name to its raw variant.
+
+    Args:
+        value: Candidate expression or column name.
+
+    Returns:
+        The internal :class:`~datafusion._internal.expr.Expr` representation.
+
+    Raises:
+        TypeError: If ``value`` is neither an :class:`Expr` nor ``str``.
+    """
+    if isinstance(value, str):
+        return Expr.column(value).expr
+    if isinstance(value, Expr):
+        return value.expr
+    error = (
+        "Expected Expr or column name, found:"
+        f" {type(value).__name__}. {EXPR_TYPE_ERROR}."
+    )
+    raise TypeError(error)
+
+
 def expr_list_to_raw_expr_list(
-    expr_list: Optional[list[Expr] | Expr],
+    expr_list: Optional[_typing.Union[Sequence[_typing.Union[Expr, str]], 
Expr, str]],

Review Comment:
   FWIW I find the `_typing.Union` to be an eye sore. Would it be possible to 
just import Union if we're going in this direction? 
   
   Also the `|` is the preferred syntax in Python 3.10 and later and 3.9 
reaches end of life next month. Maybe we just stick with `|` and at the end of 
October update our minimum version to 3.10. What do you think?



##########
python/datafusion/dataframe.py:
##########
@@ -426,36 +438,49 @@ def filter(self, *predicates: Expr) -> DataFrame:
         """
         df = self.df
         for p in predicates:
-            df = df.filter(p.expr)
+            df = df.filter(ensure_expr(p))
         return DataFrame(df)
 
     def with_column(self, name: str, expr: Expr) -> DataFrame:
         """Add an additional column to the DataFrame.
 
+        The ``expr`` must be an :class:`~datafusion.expr.Expr` constructed with
+        :func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not
+        accepted.
+
+        Example::
+
+            from datafusion import col, lit
+            df.with_column("b", col("a") + lit(1))
+
         Args:
             name: Name of the column to add.
             expr: Expression to compute the column.
 
         Returns:
             DataFrame with the new column.
         """
-        return DataFrame(self.df.with_column(name, expr.expr))
+        return DataFrame(self.df.with_column(name, ensure_expr(expr)))
 
     def with_columns(
         self, *exprs: Expr | Iterable[Expr], **named_exprs: Expr
     ) -> DataFrame:
         """Add columns to the DataFrame.
 
-        By passing expressions, iteratables of expressions, or named 
expressions. To
-        pass named expressions use the form name=Expr.
+        By passing expressions, iterables of expressions, or named expressions.
+        All expressions must be :class:`~datafusion.expr.Expr` objects created 
via
+        :func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not

Review Comment:
   Same comment as before



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to