Hi All, I was wondering if we have any best practices on using pandas UDF ? Profiling UDF is not an easy task and our case requires some drilling down on the logic of the function.
Our use case: We are using func(Dataframe) => Dataframe as interface to use Pandas UDF, while running locally only the function, it runs faster but when executed in Spark environment - the processing time is more than expected. We have one column where the value is large (BinaryType -> 600KB), wondering whether this could make the Arrow computation slower ? Is there any profiling or best way to debug the cost incurred using pandas UDF ? Thanks, Subash