Hi devs,

I am looking for some PySpark dev that is interested in some 10x to 100x speed up of df.groupby().applyInPandas() for small groups.

A PoC and benchmark can be found at https://github.com/apache/spark/pull/37360#issuecomment-1228293766.

I suppose, the same approach could be taken to improve performance of vectorized UDFs (for small groups): https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.pandas_udf.html

Happy to turn this into a proper pull request if someone volunteers to review this.

Cheers,
Enrico


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to