Re: Profiling PySpark Pandas UDF

Sean Owen Thu, 25 Aug 2022 10:00:35 -0700

It's important to realize that while pandas UDFs and pandas on Spark are
both related to pandas, they are not themselves directly related. The first
lets you use pandas within Spark, the second lets you use pandas on Spark.

Hard to say with this info but you want to look at whether you are doing
something expensive in each UDF call and consider amortizing it with the
scalar iterator UDF pattern. Maybe.

A pandas UDF is not spark code itself so no there is no tool in spark to
profile it. Conversely any approach to profiling pandas or python would
work here .

On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi,
>
> May be I am jumping to conclusions and making stupid guesses, but have you
> tried koalas now that it is natively integrated with pyspark??
>
> Regards
> Gourav
>
> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, <subashpraba...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I was wondering if we have any best practices on using pandas UDF ?
>> Profiling UDF is not an easy task and our case requires some drilling down
>> on the logic of the function.
>>
>>
>> Our use case:
>> We are using func(Dataframe) => Dataframe as interface to use Pandas UDF,
>> while running locally only the function, it runs faster but when executed
>> in Spark environment - the processing time is more than expected. We have
>> one column where the value is large (BinaryType -> 600KB), wondering
>> whether this could make the Arrow computation slower ?
>>
>> Is there any profiling or best way to debug the cost incurred using
>> pandas UDF ?
>>
>>
>> Thanks,
>> Subash
>>
>>

Re: Profiling PySpark Pandas UDF

Reply via email to