Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Abdeali Kothari
The python profiler is pretty cool ! Ill try it out to see what could be taking time within the UDF with it. I'm wondering if there is also some lightweight profiling (which does not slow down my processing) for me to get: - how much time the UDF took (like how much time was spent inside the

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Subash Prabanantham
Wow, lots of good suggestions. I didn’t know about the profiler either. Great suggestion @Takuya. Thanks, Subash On Thu, 25 Aug 2022 at 19:30, Russell Jurney wrote: > YOU know what you're talking about and aren't hacking a solution. You are > my new friend :) Thank you, this is incredibly

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Russell Jurney
YOU know what you're talking about and aren't hacking a solution. You are my new friend :) Thank you, this is incredibly helpful! Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI FB

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Sean Owen
Oh whoa I didn't realize we had this! I stand corrected On Thu, Aug 25, 2022, 12:52 PM Takuya UESHIN wrote: > Hi Subash, > > Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3? > - > https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Takuya UESHIN
Hi Subash, Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3? - https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf Hope it can help you. Thanks. On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney wrote: > Subash, I’m here to help :)

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Russell Jurney
Subash, I’m here to help :) I started a test script to demonstrate a solution last night but got a cold and haven’t finished it. Give me another day and I’ll get it to you. My suggestion is that you run PySpark locally in pytest with a fixture to generate and yield your SparckContext and

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Sean Owen
It's important to realize that while pandas UDFs and pandas on Spark are both related to pandas, they are not themselves directly related. The first lets you use pandas within Spark, the second lets you use pandas on Spark. Hard to say with this info but you want to look at whether you are doing

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Andrew Melo
Hi Gourav, Since Koalas needs the same round-trip to/from JVM and Python, I expect that the performance should be nearly the same for UDFs in either API Cheers Andrew On Thu, Aug 25, 2022 at 11:22 AM Gourav Sengupta wrote: > > Hi, > > May be I am jumping to conclusions and making stupid

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Gourav Sengupta
Hi, May be I am jumping to conclusions and making stupid guesses, but have you tried koalas now that it is natively integrated with pyspark?? Regards Gourav On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, wrote: > Hi All, > > I was wondering if we have any best practices on using pandas UDF ?

Profiling PySpark Pandas UDF

2022-08-25 Thread Subash Prabanantham
Hi All, I was wondering if we have any best practices on using pandas UDF ? Profiling UDF is not an easy task and our case requires some drilling down on the logic of the function. Our use case: We are using func(Dataframe) => Dataframe as interface to use Pandas UDF, while running locally only