Profiling PySpark Pandas UDF

Subash Prabanantham Thu, 25 Aug 2022 03:06:51 -0700

Hi All,

I was wondering if we have any best practices on using pandas UDF ?
Profiling UDF is not an easy task and our case requires some drilling down
on the logic of the function.



Our use case:
We are using func(Dataframe) => Dataframe as interface to use Pandas UDF,
while running locally only the function, it runs faster but when executed
in Spark environment - the processing time is more than expected. We have
one column where the value is large (BinaryType -> 600KB), wondering
whether this could make the Arrow computation slower ?

Is there any profiling or best way to debug the cost incurred using pandas
UDF ?


Thanks,
Subash

Profiling PySpark Pandas UDF

Reply via email to