Re: [PR] Complete doc string examples for functions.py [datafusion-python]

via GitHub Fri, 27 Mar 2026 07:18:58 -0700


timsaucer commented on code in PR #1435:
URL: 
https://github.com/apache/datafusion-python/pull/1435#discussion_r3001143614



##########
python/datafusion/functions.py:
##########
@@ -490,6 +502,13 @@ def alias(expr: Expr, name: str, metadata: dict[str, str] 
| None = None) -> Expr
         ...     dfn.functions.alias(dfn.col("a"), "b")
         ... ).collect_column("b")[0].as_py()
         1
+
+        >>> df.select(
+        ...     dfn.functions.alias(
+        ...         dfn.col("a"), "b", metadata={"info": "test"}
+        ...     )
+        ... ).collect_column("b")[0].as_py()
+        1

Review Comment:
   This doesn't really demonstrate the metadata output. Instead of converting 
to py maybe we need to check something else. I'm not 100% sure what right now 
but I'll take a look after finishing this review.



##########
python/datafusion/functions.py:
##########
@@ -3411,12 +3637,20 @@ def count(
         filter: If provided, only compute against rows for which the filter is 
True
 
     Examples:
-    ---------
-    >>> ctx = dfn.SessionContext()
-    >>> df = ctx.from_pydict({"a": [1, 2, 3]})
-    >>> result = df.aggregate([], 
[dfn.functions.count(dfn.col("a")).alias("v")])
-    >>> result.collect_column("v")[0].as_py()
-    3
+        >>> ctx = dfn.SessionContext()
+        >>> df = ctx.from_pydict({"a": [1, 2, 3]})
+        >>> result = df.aggregate([], 
[dfn.functions.count(dfn.col("a")).alias("v")])
+        >>> result.collect_column("v")[0].as_py()
+        3
+
+        >>> df = ctx.from_pydict({"a": [1, 1, 2, 3]})
+        >>> result = df.aggregate(
+        ...     [], [dfn.functions.count(
+        ...         dfn.col("a"), distinct=True,
+        ...         filter=dfn.col("a") > dfn.lit(1),
+        ...     ).alias("v")])
+        >>> result.collect_column("v")[0].as_py()
+        2

Review Comment:
   nit: would be slightly easier for users to see the difference in the 
examples if they had the same formatting but I bet it's the linter making the 
call



##########
python/datafusion/functions.py:
##########
@@ -3548,12 +3802,20 @@ def median(
         filter: If provided, only compute against rows for which the filter is 
True
 
     Examples:
-    ---------
-    >>> ctx = dfn.SessionContext()
-    >>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0]})
-    >>> result = df.aggregate([], 
[dfn.functions.median(dfn.col("a")).alias("v")])
-    >>> result.collect_column("v")[0].as_py()
-    2.0
+        >>> ctx = dfn.SessionContext()
+        >>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0]})
+        >>> result = df.aggregate([], 
[dfn.functions.median(dfn.col("a")).alias("v")])
+        >>> result.collect_column("v")[0].as_py()

Review Comment:
   It would be good to use the same values for both examples on the input so 
the user can see the difference that distinct makes.



##########
python/datafusion/functions.py:
##########
@@ -4216,12 +4627,20 @@ def bit_xor(
         filter: If provided, only compute against rows for which the filter is 
True
 
     Examples:
-    ---------
-    >>> ctx = dfn.SessionContext()
-    >>> df = ctx.from_pydict({"a": [5, 3]})
-    >>> result = df.aggregate([], 
[dfn.functions.bit_xor(dfn.col("a")).alias("v")])
-    >>> result.collect_column("v")[0].as_py()
-    6
+        >>> ctx = dfn.SessionContext()
+        >>> df = ctx.from_pydict({"a": [5, 3]})
+        >>> result = df.aggregate([], 
[dfn.functions.bit_xor(dfn.col("a")).alias("v")])
+        >>> result.collect_column("v")[0].as_py()
+        6
+
+        >>> df = ctx.from_pydict({"a": [5, 5, 3]})
+        >>> result = df.aggregate(
+        ...     [], [dfn.functions.bit_xor(
+        ...         dfn.col("a"), distinct=True,
+        ...         filter=dfn.col("a") > dfn.lit(0),
+        ...     ).alias("v")])
+        >>> result.collect_column("v")[0].as_py()
+        6

Review Comment:
   It would be good to use values that generate different results in the 
examples



##########
python/datafusion/functions.py:
##########
@@ -1159,14 +1174,21 @@ def lpad(string: Expr, count: Expr, characters: Expr | 
None = None) -> Expr:
     truncated (on the right).
 
     Examples:
-    ---------
-    >>> ctx = dfn.SessionContext()
-    >>> df = ctx.from_pydict({"a": ["the cat", "a hat"]})
-    >>> lpad_df = df.select(dfn.functions.lpad(dfn.col("a"), 
dfn.lit(6)).alias("lpad"))
-    >>> lpad_df.collect_column("lpad")[0].as_py()
-    'the ca'
-    >>> lpad_df.collect_column("lpad")[1].as_py()
-    ' a hat'
+        >>> ctx = dfn.SessionContext()
+        >>> df = ctx.from_pydict({"a": ["the cat", "a hat"]})
+        >>> lpad_df = df.select(
+        ...     dfn.functions.lpad(dfn.col("a"), dfn.lit(6)).alias("lpad"))
+        >>> lpad_df.collect_column("lpad")[0].as_py()
+        'the ca'
+        >>> lpad_df.collect_column("lpad")[1].as_py()
+        ' a hat'
+
+        >>> result = df.select(
+        ...     dfn.functions.lpad(
+        ...         dfn.col("a"), dfn.lit(10), dfn.lit(".")
+        ...     ).alias("lpad"))

Review Comment:
   I'm wondering if we should use keywords to help make it more explicit, but 
not a blocker.



##########
python/datafusion/functions.py:
##########
@@ -3862,6 +4175,15 @@ def regr_intercept(
         ...     [dfn.functions.regr_intercept(dfn.col("y"), 
dfn.col("x")).alias("v")])
         >>> result.collect_column("v")[0].as_py()
         0.0
+
+        >>> result = df.aggregate(
+        ...     [],
+        ...     [dfn.functions.regr_intercept(
+        ...         dfn.col("y"), dfn.col("x"),
+        ...         filter=dfn.col("y") > dfn.lit(2.0)
+        ...     ).alias("v")])
+        >>> result.collect_column("v")[0].as_py()
+        0.0

Review Comment:
   It would be good to use values that generate different results in the 
examples



##########
python/datafusion/functions.py:
##########
@@ -3893,6 +4215,14 @@ def regr_r2(
         ...     [], [dfn.functions.regr_r2(dfn.col("y"), 
dfn.col("x")).alias("v")])
         >>> result.collect_column("v")[0].as_py()
         1.0
+
+        >>> result = df.aggregate(
+        ...     [], [dfn.functions.regr_r2(
+        ...         dfn.col("y"), dfn.col("x"),
+        ...         filter=dfn.col("y") > dfn.lit(2.0)
+        ...     ).alias("v")])
+        >>> result.collect_column("v")[0].as_py()
+        1.0

Review Comment:
   It would be good to use values that generate different results in the 
examples



##########
python/datafusion/functions.py:
##########
@@ -3700,6 +3980,15 @@ def var_pop(expression: Expr, filter: Expr | None = 
None) -> Expr:
         >>> result = df.aggregate([], 
[dfn.functions.var_pop(dfn.col("a")).alias("v")])
         >>> result.collect_column("v")[0].as_py()
         1.0
+
+        >>> df = ctx.from_pydict({"a": [-1.0, 0.0, 2.0]})
+        >>> result = df.aggregate(
+        ...     [], [dfn.functions.var_pop(
+        ...         dfn.col("a"),
+        ...         filter=dfn.col("a") > dfn.lit(-1.0)
+        ...     ).alias("v")])
+        >>> result.collect_column("v")[0].as_py()
+        1.0

Review Comment:
   It would be good to use consistent values in both examples so that users can 
see the impact of the filter.



##########
python/datafusion/functions.py:
##########
@@ -3924,6 +4254,14 @@ def regr_slope(
         ...     [], [dfn.functions.regr_slope(dfn.col("y"), 
dfn.col("x")).alias("v")])
         >>> result.collect_column("v")[0].as_py()
         2.0
+
+        >>> result = df.aggregate(
+        ...     [], [dfn.functions.regr_slope(
+        ...         dfn.col("y"), dfn.col("x"),
+        ...         filter=dfn.col("y") > dfn.lit(2.0)
+        ...     ).alias("v")])
+        >>> result.collect_column("v")[0].as_py()
+        2.0

Review Comment:
   It would be good to use values that generate different results in the 
examples



##########
python/datafusion/functions.py:
##########
@@ -4137,6 +4521,17 @@ def nth_value(
         ... )
         >>> result.collect_column("v")[0].as_py()
         20
+
+        >>> df = ctx.from_pydict({"a": [None, 20, 10]})
+        >>> result = df.aggregate(
+        ...     [], [dfn.functions.nth_value(
+        ...         dfn.col("a"), 1,
+        ...         filter=dfn.col("a") > dfn.lit(10),
+        ...         order_by="a",
+        ...         null_treatment=dfn.common.NullTreatment.IGNORE_NULLS,
+        ...     ).alias("v")])
+        >>> result.collect_column("v")[0].as_py()
+        20

Review Comment:
   It would be good to use values that generate different results in the 
examples



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Complete doc string examples for functions.py [datafusion-python]

Reply via email to