Re: Profiling PySpark Pandas UDF

Abdeali Kothari Thu, 25 Aug 2022 21:37:32 -0700

The python profiler is pretty cool !
Ill try it out to see what could be taking time within the UDF with it.


I'm wondering if there is also some lightweight profiling (which does not
slow down my processing) for me to get:

 - how much time the UDF took (like how much time was spent inside the UDF)
 - how many times the UDF was called

I can see the overall time a stage took in the Spark UI - would be cool if
I could find the time a UDF takes too

On Fri, 26 Aug 2022, 00:25 Subash Prabanantham, <subashpraba...@gmail.com>
wrote:

> Wow, lots of good suggestions. I didn’t know about the profiler either.
> Great suggestion @Takuya.
>
>
> Thanks,
> Subash
>
> On Thu, 25 Aug 2022 at 19:30, Russell Jurney <russell.jur...@gmail.com>
> wrote:
>
>> YOU know what you're talking about and aren't hacking a solution. You are
>> my new friend :) Thank you, this is incredibly helpful!
>>
>>
>> Thanks,
>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>> <http://facebook.com/jurney> datasyndrome.com
>>
>>
>> On Thu, Aug 25, 2022 at 10:52 AM Takuya UESHIN <ues...@happy-camper.st>
>> wrote:
>>
>>> Hi Subash,
>>>
>>> Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3?
>>> -
>>> https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf
>>>
>>> Hope it can help you.
>>>
>>> Thanks.
>>>
>>> On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney <
>>> russell.jur...@gmail.com> wrote:
>>>
>>>> Subash, I’m here to help :)
>>>>
>>>> I started a test script to demonstrate a solution last night but got a
>>>> cold and haven’t finished it. Give me another day and I’ll get it to you.
>>>> My suggestion is that you run PySpark locally in pytest with a fixture to
>>>> generate and yield your SparckContext and SparkSession and the. Write tests
>>>> that load some test data, perform some count operation and checkpoint to
>>>> ensure that data is loaded, start a timer, run your UDF on the DataFrame,
>>>> checkpoint again or write some output to disk to make sure it finishes and
>>>> then stop the timer and compute how long it takes. I’ll show you some code,
>>>> I have to do this for Graphlet AI’s RTL utils and other tools to figure out
>>>> how much overhead there is using Pandera and Spark together to validate
>>>> data: https://github.com/Graphlet-AI/graphlet
>>>>
>>>> I’ll respond by tomorrow evening with code in a fist! We’ll see if it
>>>> gets consistent, measurable and valid results! :)
>>>>
>>>> Russell Jurney
>>>>
>>>> On Thu, Aug 25, 2022 at 10:00 AM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> It's important to realize that while pandas UDFs and pandas on Spark
>>>>> are both related to pandas, they are not themselves directly related. The
>>>>> first lets you use pandas within Spark, the second lets you use pandas on
>>>>> Spark.
>>>>>
>>>>> Hard to say with this info but you want to look at whether you are
>>>>> doing something expensive in each UDF call and consider amortizing it with
>>>>> the scalar iterator UDF pattern. Maybe.
>>>>>
>>>>> A pandas UDF is not spark code itself so no there is no tool in spark
>>>>> to profile it. Conversely any approach to profiling pandas or python would
>>>>> work here .
>>>>>
>>>>> On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta <
>>>>> gourav.sengu...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> May be I am jumping to conclusions and making stupid guesses, but
>>>>>> have you tried koalas now that it is natively integrated with pyspark??
>>>>>>
>>>>>> Regards
>>>>>> Gourav
>>>>>>
>>>>>> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, <
>>>>>> subashpraba...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I was wondering if we have any best practices on using pandas UDF ?
>>>>>>> Profiling UDF is not an easy task and our case requires some drilling 
>>>>>>> down
>>>>>>> on the logic of the function.
>>>>>>>
>>>>>>>
>>>>>>> Our use case:
>>>>>>> We are using func(Dataframe) => Dataframe as interface to use Pandas
>>>>>>> UDF, while running locally only the function, it runs faster but when
>>>>>>> executed in Spark environment - the processing time is more than 
>>>>>>> expected.
>>>>>>> We have one column where the value is large (BinaryType -> 600KB),
>>>>>>> wondering whether this could make the Arrow computation slower ?
>>>>>>>
>>>>>>> Is there any profiling or best way to debug the cost incurred using
>>>>>>> pandas UDF ?
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Subash
>>>>>>>
>>>>>>> --
>>>>
>>>> Thanks,
>>>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>>>> <http://facebook.com/jurney> datasyndrome.com
>>>>
>>>
>>>
>>> --
>>> Takuya UESHIN
>>>
>>>

Re: Profiling PySpark Pandas UDF

Reply via email to