10x to 100x faster df.groupby().applyInPandas()

Enrico Minack Fri, 01 Dec 2023 07:24:24 -0800

Hi devs,

I am looking for some PySpark dev that is interested in some 10x to 100xspeed up of df.groupby().applyInPandas() for small groups.

A PoC and benchmark can be found athttps://github.com/apache/spark/pull/37360#issuecomment-1228293766.

I suppose, the same approach could be taken to improve performance ofvectorized UDFs (for small groups):https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.pandas_udf.html

Happy to turn this into a proper pull request if someone volunteers toreview this.


Cheers,
Enrico


---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

10x to 100x faster df.groupby().applyInPandas()

Reply via email to