dtenedor opened a new pull request, #43946: URL: https://github.com/apache/spark/pull/43946
### What changes were proposed in this pull request? This PR updates the Python user-defined table function (UDTF) API for the `analyze` method to support general expressions for the `partitionBy` and `orderBy` fields of the `AnalyzeResult` class. For example, the following UDTF specifies to partition by `partition_col / 10` so that all rows with values of this column between 0-9 arrive in the same partition, then all rows with values between 10-19 in the next partition, and so on. ``` @udtf class TestUDTF: def __init__(self): self._partition_col = None self._count = 0 self._sum = 0 self._last = None @staticmethod def analyze(*args, **kwargs): return AnalyzeResult( schema=StructType() .add("partition_col", IntegerType()) .add("count", IntegerType()) .add("total", IntegerType()) .add("last", IntegerType()), partitionBy=[PartitioningExpression("partition_col / 10")], orderBy=[ OrderingExpression(value="input", ascending=True, overrideNullsFirst=False) ], ) def eval(self, row: Row): self._partition_col = row["partition_col"] self._count += 1 self._last = row["input"] if row["input"] is not None: self._sum += row["input"] def terminate(self): yield self._partition_col, self._count, self._sum, self._last ``` ### Why are the changes needed? This lets the UDTF partition by simple references to the columns of the input table just like before, but also gives the option to partition by general expressions based on those columns (just like the explicit `PARTITION BY` and `ORDER BY` clauses in the UDTF call in SQL). ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR includes test coverage. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org