Hi devs,
I am looking for some PySpark dev that is interested in some 10x to 100x
speed up of df.groupby().applyInPandas() for small groups.
A PoC and benchmark can be found at
https://github.com/apache/spark/pull/37360#issuecomment-1228293766.
I suppose, the same approach could be taken to improve performance of
vectorized UDFs (for small groups):
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.pandas_udf.html
Happy to turn this into a proper pull request if someone volunteers to
review this.
Cheers,
Enrico
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org