dtenedor opened a new pull request, #43946:
URL: https://github.com/apache/spark/pull/43946

   ### What changes were proposed in this pull request?
   
   This PR updates the Python user-defined table function (UDTF) API for the 
`analyze` method to support general expressions for the `partitionBy` and 
`orderBy` fields of the `AnalyzeResult` class.
   
   For example, the following UDTF specifies to partition by `partition_col / 
10` so that all rows with values of this column between 0-9 arrive in the same 
partition, then all rows with values between 10-19 in the next partition, and 
so on.
   
   ```
   @udtf
   class TestUDTF:
       def __init__(self):
        self._partition_col = None
        self._count = 0
        self._sum = 0
        self._last = None
   
       @staticmethod
       def analyze(*args, **kwargs):
        return AnalyzeResult(
            schema=StructType()
            .add("partition_col", IntegerType())
            .add("count", IntegerType())
            .add("total", IntegerType())
            .add("last", IntegerType()),
            partitionBy=[PartitioningExpression("partition_col / 10")],
            orderBy=[
                OrderingExpression(value="input", ascending=True, 
overrideNullsFirst=False)
            ],
        )
   
       def eval(self, row: Row):
        self._partition_col = row["partition_col"]
        self._count += 1
        self._last = row["input"]
        if row["input"] is not None:
            self._sum += row["input"]
   
       def terminate(self):
        yield self._partition_col, self._count, self._sum, self._last
   ```
   
   ### Why are the changes needed?
   
   This lets the UDTF partition by simple references to the columns of the 
input table just like before, but also gives the option to partition by general 
expressions based on those columns (just like the explicit `PARTITION BY` and 
`ORDER BY` clauses in the UDTF call in SQL).
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, see above.
   
   ### How was this patch tested?
   
   This PR includes test coverage.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to