Re: Profiling PySpark Pandas UDF

2022-08-29 Thread Gourav Sengupta
Hi,

I do send back those metrics in as columns in the pandas datagrams in case
required, but the true thing is that we need to finally be able to find out
the time for java object conversion along with the udf calls and actual
python memory and other details which we can all do by tweaking udf.

But I am 100 percent sure that your work will be useful and required.

Regards,
Gourav

On Mon, 29 Aug 2022, 10:36 Luca Canali,  wrote:

> Hi Abdeali,
>
>
>
> Thanks for the support. Indeed you can go ahead and test and review  my
> latest PR for SPARK-34265
>
> (Instrument Python UDF execution using SQL Metrics) if you want to:
> https://github.com/apache/spark/pull/33559
>
> Currently I reduced the scope of the instrumentation to just 3 simple
> metrics to implement: "data sent to Python workers",
>
> "data returned from Python workers", "number of output rows".
> In a previous attempt I had also instrumented the time for UDF execution,
> although there are some subtle points there,
>
> and I may need to go back to testing that at a later stage.
>
> It definitely would be good to know if people using PySpark and Python
> UDFs find this proposed improvement useful.
>
> I see the proposed additional instrumentation as complementary to the
> Python/Pandas UDF Profiler introduced in Spark 3.3.
>
>
>
> Best,
>
> Luca
>
>
>
> *From:* Abdeali Kothari 
> *Sent:* Friday, August 26, 2022 15:59
> *To:* Luca Canali 
> *Cc:* Russell Jurney ; Gourav Sengupta <
> gourav.sengu...@gmail.com>; Sean Owen ; Takuya UESHIN <
> ues...@happy-camper.st>; user ; Subash
> Prabanantham 
> *Subject:* Re: Profiling PySpark Pandas UDF
>
>
>
> Hi Luca, I see you pushed some code to the PR 3 hrs ago.
>
> That's awesome. If I can help out in any way - do let me know
>
> I think that's an amazing feature and would be great if it can get into
> spark
>
>
>
> On Fri, 26 Aug 2022, 12:41 Luca Canali,  wrote:
>
> @Abdeali as for “lightweight profiling”, there is some work in progress on
> instrumenting Python UDFs with Spark metrics, see
> https://issues.apache.org/jira/browse/SPARK-34265
>
> However it is a bit stuck at the moment, and needs to be revived I
> believe.
>
>
>
> Best,
>
> Luca
>
>
>
> *From:* Abdeali Kothari 
> *Sent:* Friday, August 26, 2022 06:36
> *To:* Subash Prabanantham 
> *Cc:* Russell Jurney ; Gourav Sengupta <
> gourav.sengu...@gmail.com>; Sean Owen ; Takuya UESHIN <
> ues...@happy-camper.st>; user 
> *Subject:* Re: Profiling PySpark Pandas UDF
>
>
>
> The python profiler is pretty cool !
>
> Ill try it out to see what could be taking time within the UDF with it.
>
>
>
> I'm wondering if there is also some lightweight profiling (which does not
> slow down my processing) for me to get:
>
>
>
>  - how much time the UDF took (like how much time was spent inside the UDF)
>
>  - how many times the UDF was called
>
>
>
> I can see the overall time a stage took in the Spark UI - would be cool if
> I could find the time a UDF takes too
>
>
>
> On Fri, 26 Aug 2022, 00:25 Subash Prabanantham, 
> wrote:
>
> Wow, lots of good suggestions. I didn’t know about the profiler either.
> Great suggestion @Takuya.
>
>
>
>
>
> Thanks,
>
> Subash
>
>
>
> On Thu, 25 Aug 2022 at 19:30, Russell Jurney 
> wrote:
>
> YOU know what you're talking about and aren't hacking a solution. You are
> my new friend :) Thank you, this is incredibly helpful!
>
>
>
>
> Thanks,
>
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
>
>
>
>
> On Thu, Aug 25, 2022 at 10:52 AM Takuya UESHIN 
> wrote:
>
> Hi Subash,
>
> Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3?
> -
> https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf
>
> Hope it can help you.
>
> Thanks.
>
>
>
> On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney 
> wrote:
>
> Subash, I’m here to help :)
>
>
>
> I started a test script to demonstrate a solution last night but got a
> cold and haven’t finished it. Give me another day and I’ll get it to you.
> My suggestion is that you run PySpark locally in pytest with a fixture to
> generate and yield your SparckContext and SparkSession and the. Write tests
> that load some test data, perform some count operation and checkpoint to
> ensure that data is loaded, start a timer, run your UDF on the DataFrame,
> checkpoint ag

RE: Profiling PySpark Pandas UDF

2022-08-29 Thread Luca Canali
Hi Abdeali,

 

Thanks for the support. Indeed you can go ahead and test and review  my latest 
PR for SPARK-34265 

(Instrument Python UDF execution using SQL Metrics) if you want to:   
https://github.com/apache/spark/pull/33559



Currently I reduced the scope of the instrumentation to just 3 simple metrics 
to implement: "data sent to Python workers",

"data returned from Python workers", "number of output rows".
In a previous attempt I had also instrumented the time for UDF execution, 
although there are some subtle points there, 

and I may need to go back to testing that at a later stage.



It definitely would be good to know if people using PySpark and Python UDFs 
find this proposed improvement useful.

I see the proposed additional instrumentation as complementary to the 
Python/Pandas UDF Profiler introduced in Spark 3.3.

 

Best,

Luca

 

From: Abdeali Kothari  
Sent: Friday, August 26, 2022 15:59
To: Luca Canali 
Cc: Russell Jurney ; Gourav Sengupta 
; Sean Owen ; Takuya UESHIN 
; user ; Subash Prabanantham 

Subject: Re: Profiling PySpark Pandas UDF

 

Hi Luca, I see you pushed some code to the PR 3 hrs ago.

That's awesome. If I can help out in any way - do let me know

I think that's an amazing feature and would be great if it can get into spark

 

On Fri, 26 Aug 2022, 12:41 Luca Canali, mailto:luca.can...@cern.ch> > wrote:

@Abdeali as for “lightweight profiling”, there is some work in progress on 
instrumenting Python UDFs with Spark metrics, see 
https://issues.apache.org/jira/browse/SPARK-34265  

However it is a bit stuck at the moment, and needs to be revived I believe.  

 

Best,

Luca

 

From: Abdeali Kothari mailto:abdealikoth...@gmail.com> > 
Sent: Friday, August 26, 2022 06:36
To: Subash Prabanantham mailto:subashpraba...@gmail.com> >
Cc: Russell Jurney mailto:russell.jur...@gmail.com> 
>; Gourav Sengupta mailto:gourav.sengu...@gmail.com> >; Sean Owen mailto:sro...@gmail.com> >; Takuya UESHIN mailto:ues...@happy-camper.st> >; user mailto:user@spark.apache.org> >
Subject: Re: Profiling PySpark Pandas UDF

 

The python profiler is pretty cool !

Ill try it out to see what could be taking time within the UDF with it.

 

I'm wondering if there is also some lightweight profiling (which does not slow 
down my processing) for me to get:

 

 - how much time the UDF took (like how much time was spent inside the UDF)

 - how many times the UDF was called 

 

I can see the overall time a stage took in the Spark UI - would be cool if I 
could find the time a UDF takes too

 

On Fri, 26 Aug 2022, 00:25 Subash Prabanantham, mailto:subashpraba...@gmail.com> > wrote:

Wow, lots of good suggestions. I didn’t know about the profiler either. Great 
suggestion @Takuya. 

 

 

Thanks,

Subash

 

On Thu, 25 Aug 2022 at 19:30, Russell Jurney mailto:russell.jur...@gmail.com> > wrote:

YOU know what you're talking about and aren't hacking a solution. You are my 
new friend :) Thank you, this is incredibly helpful!




 

Thanks,

Russell Jurney  <http://twitter.com/rjurney> @rjurney  
<mailto:russell.jur...@gmail.com> russell.jur...@gmail.com  
<http://linkedin.com/in/russelljurney> LI  <http://facebook.com/jurney> FB  
<http://datasyndrome.com> datasyndrome.com

 

 

On Thu, Aug 25, 2022 at 10:52 AM Takuya UESHIN mailto:ues...@happy-camper.st> > wrote:

Hi Subash,

Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3?
- 
https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf

Hope it can help you.

Thanks.

 

On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney mailto:russell.jur...@gmail.com> > wrote:

Subash, I’m here to help :)

 

I started a test script to demonstrate a solution last night but got a cold and 
haven’t finished it. Give me another day and I’ll get it to you. My suggestion 
is that you run PySpark locally in pytest with a fixture to generate and yield 
your SparckContext and SparkSession and the. Write tests that load some test 
data, perform some count operation and checkpoint to ensure that data is 
loaded, start a timer, run your UDF on the DataFrame, checkpoint again or write 
some output to disk to make sure it finishes and then stop the timer and 
compute how long it takes. I’ll show you some code, I have to do this for 
Graphlet AI’s RTL utils and other tools to figure out how much overhead there 
is using Pandera and Spark together to validate data: 
https://github.com/Graphlet-AI/graphlet

 

I’ll respond by tomorrow evening with code in a fist! We’ll see if it gets 
consistent, measurable and valid results! :)

 

Russell Jurney

 

On Thu, Aug 25, 2022 at 10:00 AM Sean Owen mailto:sro...@gmail.com> > wrote:

It's important to realize that while pandas UDFs and pandas on Spark are both 
related to pandas, they are not themselves directly related. The first lets you 
use p

Re: Profiling PySpark Pandas UDF

2022-08-26 Thread Abdeali Kothari
Hi Luca, I see you pushed some code to the PR 3 hrs ago.
That's awesome. If I can help out in any way - do let me know
I think that's an amazing feature and would be great if it can get into
spark

On Fri, 26 Aug 2022, 12:41 Luca Canali,  wrote:

> @Abdeali as for “lightweight profiling”, there is some work in progress on
> instrumenting Python UDFs with Spark metrics, see
> https://issues.apache.org/jira/browse/SPARK-34265
>
> However it is a bit stuck at the moment, and needs to be revived I
> believe.
>
>
>
> Best,
>
> Luca
>
>
>
> *From:* Abdeali Kothari 
> *Sent:* Friday, August 26, 2022 06:36
> *To:* Subash Prabanantham 
> *Cc:* Russell Jurney ; Gourav Sengupta <
> gourav.sengu...@gmail.com>; Sean Owen ; Takuya UESHIN <
> ues...@happy-camper.st>; user 
> *Subject:* Re: Profiling PySpark Pandas UDF
>
>
>
> The python profiler is pretty cool !
>
> Ill try it out to see what could be taking time within the UDF with it.
>
>
>
> I'm wondering if there is also some lightweight profiling (which does not
> slow down my processing) for me to get:
>
>
>
>  - how much time the UDF took (like how much time was spent inside the UDF)
>
>  - how many times the UDF was called
>
>
>
> I can see the overall time a stage took in the Spark UI - would be cool if
> I could find the time a UDF takes too
>
>
>
> On Fri, 26 Aug 2022, 00:25 Subash Prabanantham, 
> wrote:
>
> Wow, lots of good suggestions. I didn’t know about the profiler either.
> Great suggestion @Takuya.
>
>
>
>
>
> Thanks,
>
> Subash
>
>
>
> On Thu, 25 Aug 2022 at 19:30, Russell Jurney 
> wrote:
>
> YOU know what you're talking about and aren't hacking a solution. You are
> my new friend :) Thank you, this is incredibly helpful!
>
>
>
>
> Thanks,
>
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
>
>
>
>
> On Thu, Aug 25, 2022 at 10:52 AM Takuya UESHIN 
> wrote:
>
> Hi Subash,
>
> Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3?
> -
> https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf
>
> Hope it can help you.
>
> Thanks.
>
>
>
> On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney 
> wrote:
>
> Subash, I’m here to help :)
>
>
>
> I started a test script to demonstrate a solution last night but got a
> cold and haven’t finished it. Give me another day and I’ll get it to you.
> My suggestion is that you run PySpark locally in pytest with a fixture to
> generate and yield your SparckContext and SparkSession and the. Write tests
> that load some test data, perform some count operation and checkpoint to
> ensure that data is loaded, start a timer, run your UDF on the DataFrame,
> checkpoint again or write some output to disk to make sure it finishes and
> then stop the timer and compute how long it takes. I’ll show you some code,
> I have to do this for Graphlet AI’s RTL utils and other tools to figure out
> how much overhead there is using Pandera and Spark together to validate
> data: https://github.com/Graphlet-AI/graphlet
>
>
>
> I’ll respond by tomorrow evening with code in a fist! We’ll see if it gets
> consistent, measurable and valid results! :)
>
>
>
> Russell Jurney
>
>
>
> On Thu, Aug 25, 2022 at 10:00 AM Sean Owen  wrote:
>
> It's important to realize that while pandas UDFs and pandas on Spark are
> both related to pandas, they are not themselves directly related. The first
> lets you use pandas within Spark, the second lets you use pandas on Spark.
>
>
>
> Hard to say with this info but you want to look at whether you are doing
> something expensive in each UDF call and consider amortizing it with the
> scalar iterator UDF pattern. Maybe.
>
>
>
> A pandas UDF is not spark code itself so no there is no tool in spark to
> profile it. Conversely any approach to profiling pandas or python would
> work here .
>
>
>
> On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta 
> wrote:
>
> Hi,
>
>
>
> May be I am jumping to conclusions and making stupid guesses, but have you
> tried koalas now that it is natively integrated with pyspark??
>
>
>
> Regards
>
> Gourav
>
>
>
> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, 
> wrote:
>
> Hi All,
>
>
>
> I was wondering if we have any best practices on using pandas UDF ?
> Profiling UDF is not an easy task and our case requires some drilling down
> on the logic of the function.
>
>
&

RE: Profiling PySpark Pandas UDF

2022-08-26 Thread Luca Canali
@Abdeali as for “lightweight profiling”, there is some work in progress on 
instrumenting Python UDFs with Spark metrics, see 
https://issues.apache.org/jira/browse/SPARK-34265  

However it is a bit stuck at the moment, and needs to be revived I believe.  

 

Best,

Luca

 

From: Abdeali Kothari  
Sent: Friday, August 26, 2022 06:36
To: Subash Prabanantham 
Cc: Russell Jurney ; Gourav Sengupta 
; Sean Owen ; Takuya UESHIN 
; user 
Subject: Re: Profiling PySpark Pandas UDF

 

The python profiler is pretty cool !

Ill try it out to see what could be taking time within the UDF with it.

 

I'm wondering if there is also some lightweight profiling (which does not slow 
down my processing) for me to get:

 

 - how much time the UDF took (like how much time was spent inside the UDF)

 - how many times the UDF was called 

 

I can see the overall time a stage took in the Spark UI - would be cool if I 
could find the time a UDF takes too

 

On Fri, 26 Aug 2022, 00:25 Subash Prabanantham, mailto:subashpraba...@gmail.com> > wrote:

Wow, lots of good suggestions. I didn’t know about the profiler either. Great 
suggestion @Takuya. 

 

 

Thanks,

Subash

 

On Thu, 25 Aug 2022 at 19:30, Russell Jurney mailto:russell.jur...@gmail.com> > wrote:

YOU know what you're talking about and aren't hacking a solution. You are my 
new friend :) Thank you, this is incredibly helpful!




 

Thanks,

Russell Jurney  <http://twitter.com/rjurney> @rjurney  
<mailto:russell.jur...@gmail.com> russell.jur...@gmail.com  
<http://linkedin.com/in/russelljurney> LI  <http://facebook.com/jurney> FB  
<http://datasyndrome.com> datasyndrome.com

 

 

On Thu, Aug 25, 2022 at 10:52 AM Takuya UESHIN mailto:ues...@happy-camper.st> > wrote:

Hi Subash,

Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3?
- 
https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf

Hope it can help you.

Thanks.

 

On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney mailto:russell.jur...@gmail.com> > wrote:

Subash, I’m here to help :)

 

I started a test script to demonstrate a solution last night but got a cold and 
haven’t finished it. Give me another day and I’ll get it to you. My suggestion 
is that you run PySpark locally in pytest with a fixture to generate and yield 
your SparckContext and SparkSession and the. Write tests that load some test 
data, perform some count operation and checkpoint to ensure that data is 
loaded, start a timer, run your UDF on the DataFrame, checkpoint again or write 
some output to disk to make sure it finishes and then stop the timer and 
compute how long it takes. I’ll show you some code, I have to do this for 
Graphlet AI’s RTL utils and other tools to figure out how much overhead there 
is using Pandera and Spark together to validate data: 
https://github.com/Graphlet-AI/graphlet

 

I’ll respond by tomorrow evening with code in a fist! We’ll see if it gets 
consistent, measurable and valid results! :)

 

Russell Jurney

 

On Thu, Aug 25, 2022 at 10:00 AM Sean Owen mailto:sro...@gmail.com> > wrote:

It's important to realize that while pandas UDFs and pandas on Spark are both 
related to pandas, they are not themselves directly related. The first lets you 
use pandas within Spark, the second lets you use pandas on Spark. 

 

Hard to say with this info but you want to look at whether you are doing 
something expensive in each UDF call and consider amortizing it with the scalar 
iterator UDF pattern. Maybe. 

 

A pandas UDF is not spark code itself so no there is no tool in spark to 
profile it. Conversely any approach to profiling pandas or python would work 
here .

 

On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta mailto:gourav.sengu...@gmail.com> > wrote:

Hi,

 

May be I am jumping to conclusions and making stupid guesses, but have you 
tried koalas now that it is natively integrated with pyspark??

 

Regards 

Gourav

 

On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, mailto:subashpraba...@gmail.com> > wrote:

Hi All,

 

I was wondering if we have any best practices on using pandas UDF ? Profiling 
UDF is not an easy task and our case requires some drilling down on the logic 
of the function. 

 

 

Our use case:

We are using func(Dataframe) => Dataframe as interface to use Pandas UDF, while 
running locally only the function, it runs faster but when executed in Spark 
environment - the processing time is more than expected. We have one column 
where the value is large (BinaryType -> 600KB), wondering whether this could 
make the Arrow computation slower ? 

 

Is there any profiling or best way to debug the cost incurred using pandas UDF ?

 

 

Thanks,

Subash

 

-- 

 

Thanks,

Russell Jurney  <http://twitter.com/rjurney> @rjurney  
<mailto:russell.jur...@gmail.com> russell.jur...@gmail.com  
<http://linkedin.com/in/russelljurney> LI  <http://facebook.com/jurney> FB  
<http://datasyndrome.com> datasyndrome.com




 

-- 

Takuya UESHIN



Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Abdeali Kothari
The python profiler is pretty cool !
Ill try it out to see what could be taking time within the UDF with it.

I'm wondering if there is also some lightweight profiling (which does not
slow down my processing) for me to get:

 - how much time the UDF took (like how much time was spent inside the UDF)
 - how many times the UDF was called

I can see the overall time a stage took in the Spark UI - would be cool if
I could find the time a UDF takes too

On Fri, 26 Aug 2022, 00:25 Subash Prabanantham, 
wrote:

> Wow, lots of good suggestions. I didn’t know about the profiler either.
> Great suggestion @Takuya.
>
>
> Thanks,
> Subash
>
> On Thu, 25 Aug 2022 at 19:30, Russell Jurney 
> wrote:
>
>> YOU know what you're talking about and aren't hacking a solution. You are
>> my new friend :) Thank you, this is incredibly helpful!
>>
>>
>> Thanks,
>> Russell Jurney @rjurney 
>> russell.jur...@gmail.com LI  FB
>>  datasyndrome.com
>>
>>
>> On Thu, Aug 25, 2022 at 10:52 AM Takuya UESHIN 
>> wrote:
>>
>>> Hi Subash,
>>>
>>> Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3?
>>> -
>>> https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf
>>>
>>> Hope it can help you.
>>>
>>> Thanks.
>>>
>>> On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney <
>>> russell.jur...@gmail.com> wrote:
>>>
 Subash, I’m here to help :)

 I started a test script to demonstrate a solution last night but got a
 cold and haven’t finished it. Give me another day and I’ll get it to you.
 My suggestion is that you run PySpark locally in pytest with a fixture to
 generate and yield your SparckContext and SparkSession and the. Write tests
 that load some test data, perform some count operation and checkpoint to
 ensure that data is loaded, start a timer, run your UDF on the DataFrame,
 checkpoint again or write some output to disk to make sure it finishes and
 then stop the timer and compute how long it takes. I’ll show you some code,
 I have to do this for Graphlet AI’s RTL utils and other tools to figure out
 how much overhead there is using Pandera and Spark together to validate
 data: https://github.com/Graphlet-AI/graphlet

 I’ll respond by tomorrow evening with code in a fist! We’ll see if it
 gets consistent, measurable and valid results! :)

 Russell Jurney

 On Thu, Aug 25, 2022 at 10:00 AM Sean Owen  wrote:

> It's important to realize that while pandas UDFs and pandas on Spark
> are both related to pandas, they are not themselves directly related. The
> first lets you use pandas within Spark, the second lets you use pandas on
> Spark.
>
> Hard to say with this info but you want to look at whether you are
> doing something expensive in each UDF call and consider amortizing it with
> the scalar iterator UDF pattern. Maybe.
>
> A pandas UDF is not spark code itself so no there is no tool in spark
> to profile it. Conversely any approach to profiling pandas or python would
> work here .
>
> On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> May be I am jumping to conclusions and making stupid guesses, but
>> have you tried koalas now that it is natively integrated with pyspark??
>>
>> Regards
>> Gourav
>>
>> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, <
>> subashpraba...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I was wondering if we have any best practices on using pandas UDF ?
>>> Profiling UDF is not an easy task and our case requires some drilling 
>>> down
>>> on the logic of the function.
>>>
>>>
>>> Our use case:
>>> We are using func(Dataframe) => Dataframe as interface to use Pandas
>>> UDF, while running locally only the function, it runs faster but when
>>> executed in Spark environment - the processing time is more than 
>>> expected.
>>> We have one column where the value is large (BinaryType -> 600KB),
>>> wondering whether this could make the Arrow computation slower ?
>>>
>>> Is there any profiling or best way to debug the cost incurred using
>>> pandas UDF ?
>>>
>>>
>>> Thanks,
>>> Subash
>>>
>>> --

 Thanks,
 Russell Jurney @rjurney 
 russell.jur...@gmail.com LI  FB
  datasyndrome.com

>>>
>>>
>>> --
>>> Takuya UESHIN
>>>
>>>


Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Subash Prabanantham
Wow, lots of good suggestions. I didn’t know about the profiler either.
Great suggestion @Takuya.


Thanks,
Subash

On Thu, 25 Aug 2022 at 19:30, Russell Jurney 
wrote:

> YOU know what you're talking about and aren't hacking a solution. You are
> my new friend :) Thank you, this is incredibly helpful!
>
>
> Thanks,
> Russell Jurney @rjurney 
> russell.jur...@gmail.com LI  FB
>  datasyndrome.com
>
>
> On Thu, Aug 25, 2022 at 10:52 AM Takuya UESHIN 
> wrote:
>
>> Hi Subash,
>>
>> Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3?
>> -
>> https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf
>>
>> Hope it can help you.
>>
>> Thanks.
>>
>> On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney 
>> wrote:
>>
>>> Subash, I’m here to help :)
>>>
>>> I started a test script to demonstrate a solution last night but got a
>>> cold and haven’t finished it. Give me another day and I’ll get it to you.
>>> My suggestion is that you run PySpark locally in pytest with a fixture to
>>> generate and yield your SparckContext and SparkSession and the. Write tests
>>> that load some test data, perform some count operation and checkpoint to
>>> ensure that data is loaded, start a timer, run your UDF on the DataFrame,
>>> checkpoint again or write some output to disk to make sure it finishes and
>>> then stop the timer and compute how long it takes. I’ll show you some code,
>>> I have to do this for Graphlet AI’s RTL utils and other tools to figure out
>>> how much overhead there is using Pandera and Spark together to validate
>>> data: https://github.com/Graphlet-AI/graphlet
>>>
>>> I’ll respond by tomorrow evening with code in a fist! We’ll see if it
>>> gets consistent, measurable and valid results! :)
>>>
>>> Russell Jurney
>>>
>>> On Thu, Aug 25, 2022 at 10:00 AM Sean Owen  wrote:
>>>
 It's important to realize that while pandas UDFs and pandas on Spark
 are both related to pandas, they are not themselves directly related. The
 first lets you use pandas within Spark, the second lets you use pandas on
 Spark.

 Hard to say with this info but you want to look at whether you are
 doing something expensive in each UDF call and consider amortizing it with
 the scalar iterator UDF pattern. Maybe.

 A pandas UDF is not spark code itself so no there is no tool in spark
 to profile it. Conversely any approach to profiling pandas or python would
 work here .

 On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta <
 gourav.sengu...@gmail.com> wrote:

> Hi,
>
> May be I am jumping to conclusions and making stupid guesses, but have
> you tried koalas now that it is natively integrated with pyspark??
>
> Regards
> Gourav
>
> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, <
> subashpraba...@gmail.com> wrote:
>
>> Hi All,
>>
>> I was wondering if we have any best practices on using pandas UDF ?
>> Profiling UDF is not an easy task and our case requires some drilling 
>> down
>> on the logic of the function.
>>
>>
>> Our use case:
>> We are using func(Dataframe) => Dataframe as interface to use Pandas
>> UDF, while running locally only the function, it runs faster but when
>> executed in Spark environment - the processing time is more than 
>> expected.
>> We have one column where the value is large (BinaryType -> 600KB),
>> wondering whether this could make the Arrow computation slower ?
>>
>> Is there any profiling or best way to debug the cost incurred using
>> pandas UDF ?
>>
>>
>> Thanks,
>> Subash
>>
>> --
>>>
>>> Thanks,
>>> Russell Jurney @rjurney 
>>> russell.jur...@gmail.com LI  FB
>>>  datasyndrome.com
>>>
>>
>>
>> --
>> Takuya UESHIN
>>
>>


Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Russell Jurney
YOU know what you're talking about and aren't hacking a solution. You are
my new friend :) Thank you, this is incredibly helpful!

Thanks,
Russell Jurney @rjurney 
russell.jur...@gmail.com LI  FB
 datasyndrome.com


On Thu, Aug 25, 2022 at 10:52 AM Takuya UESHIN 
wrote:

> Hi Subash,
>
> Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3?
> -
> https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf
>
> Hope it can help you.
>
> Thanks.
>
> On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney 
> wrote:
>
>> Subash, I’m here to help :)
>>
>> I started a test script to demonstrate a solution last night but got a
>> cold and haven’t finished it. Give me another day and I’ll get it to you.
>> My suggestion is that you run PySpark locally in pytest with a fixture to
>> generate and yield your SparckContext and SparkSession and the. Write tests
>> that load some test data, perform some count operation and checkpoint to
>> ensure that data is loaded, start a timer, run your UDF on the DataFrame,
>> checkpoint again or write some output to disk to make sure it finishes and
>> then stop the timer and compute how long it takes. I’ll show you some code,
>> I have to do this for Graphlet AI’s RTL utils and other tools to figure out
>> how much overhead there is using Pandera and Spark together to validate
>> data: https://github.com/Graphlet-AI/graphlet
>>
>> I’ll respond by tomorrow evening with code in a fist! We’ll see if it
>> gets consistent, measurable and valid results! :)
>>
>> Russell Jurney
>>
>> On Thu, Aug 25, 2022 at 10:00 AM Sean Owen  wrote:
>>
>>> It's important to realize that while pandas UDFs and pandas on Spark are
>>> both related to pandas, they are not themselves directly related. The first
>>> lets you use pandas within Spark, the second lets you use pandas on Spark.
>>>
>>> Hard to say with this info but you want to look at whether you are doing
>>> something expensive in each UDF call and consider amortizing it with the
>>> scalar iterator UDF pattern. Maybe.
>>>
>>> A pandas UDF is not spark code itself so no there is no tool in spark to
>>> profile it. Conversely any approach to profiling pandas or python would
>>> work here .
>>>
>>> On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 Hi,

 May be I am jumping to conclusions and making stupid guesses, but have
 you tried koalas now that it is natively integrated with pyspark??

 Regards
 Gourav

 On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, <
 subashpraba...@gmail.com> wrote:

> Hi All,
>
> I was wondering if we have any best practices on using pandas UDF ?
> Profiling UDF is not an easy task and our case requires some drilling down
> on the logic of the function.
>
>
> Our use case:
> We are using func(Dataframe) => Dataframe as interface to use Pandas
> UDF, while running locally only the function, it runs faster but when
> executed in Spark environment - the processing time is more than expected.
> We have one column where the value is large (BinaryType -> 600KB),
> wondering whether this could make the Arrow computation slower ?
>
> Is there any profiling or best way to debug the cost incurred using
> pandas UDF ?
>
>
> Thanks,
> Subash
>
> --
>>
>> Thanks,
>> Russell Jurney @rjurney 
>> russell.jur...@gmail.com LI  FB
>>  datasyndrome.com
>>
>
>
> --
> Takuya UESHIN
>
>


Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Sean Owen
Oh whoa I didn't realize we had this! I stand corrected

On Thu, Aug 25, 2022, 12:52 PM Takuya UESHIN  wrote:

> Hi Subash,
>
> Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3?
> -
> https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf
>
> Hope it can help you.
>
> Thanks.
>
> On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney 
> wrote:
>
>> Subash, I’m here to help :)
>>
>> I started a test script to demonstrate a solution last night but got a
>> cold and haven’t finished it. Give me another day and I’ll get it to you.
>> My suggestion is that you run PySpark locally in pytest with a fixture to
>> generate and yield your SparckContext and SparkSession and the. Write tests
>> that load some test data, perform some count operation and checkpoint to
>> ensure that data is loaded, start a timer, run your UDF on the DataFrame,
>> checkpoint again or write some output to disk to make sure it finishes and
>> then stop the timer and compute how long it takes. I’ll show you some code,
>> I have to do this for Graphlet AI’s RTL utils and other tools to figure out
>> how much overhead there is using Pandera and Spark together to validate
>> data: https://github.com/Graphlet-AI/graphlet
>>
>> I’ll respond by tomorrow evening with code in a fist! We’ll see if it
>> gets consistent, measurable and valid results! :)
>>
>> Russell Jurney
>>
>> On Thu, Aug 25, 2022 at 10:00 AM Sean Owen  wrote:
>>
>>> It's important to realize that while pandas UDFs and pandas on Spark are
>>> both related to pandas, they are not themselves directly related. The first
>>> lets you use pandas within Spark, the second lets you use pandas on Spark.
>>>
>>> Hard to say with this info but you want to look at whether you are doing
>>> something expensive in each UDF call and consider amortizing it with the
>>> scalar iterator UDF pattern. Maybe.
>>>
>>> A pandas UDF is not spark code itself so no there is no tool in spark to
>>> profile it. Conversely any approach to profiling pandas or python would
>>> work here .
>>>
>>> On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 Hi,

 May be I am jumping to conclusions and making stupid guesses, but have
 you tried koalas now that it is natively integrated with pyspark??

 Regards
 Gourav

 On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, <
 subashpraba...@gmail.com> wrote:

> Hi All,
>
> I was wondering if we have any best practices on using pandas UDF ?
> Profiling UDF is not an easy task and our case requires some drilling down
> on the logic of the function.
>
>
> Our use case:
> We are using func(Dataframe) => Dataframe as interface to use Pandas
> UDF, while running locally only the function, it runs faster but when
> executed in Spark environment - the processing time is more than expected.
> We have one column where the value is large (BinaryType -> 600KB),
> wondering whether this could make the Arrow computation slower ?
>
> Is there any profiling or best way to debug the cost incurred using
> pandas UDF ?
>
>
> Thanks,
> Subash
>
> --
>>
>> Thanks,
>> Russell Jurney @rjurney 
>> russell.jur...@gmail.com LI  FB
>>  datasyndrome.com
>>
>
>
> --
> Takuya UESHIN
>
>


Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Takuya UESHIN
Hi Subash,

Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3?
-
https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf

Hope it can help you.

Thanks.

On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney 
wrote:

> Subash, I’m here to help :)
>
> I started a test script to demonstrate a solution last night but got a
> cold and haven’t finished it. Give me another day and I’ll get it to you.
> My suggestion is that you run PySpark locally in pytest with a fixture to
> generate and yield your SparckContext and SparkSession and the. Write tests
> that load some test data, perform some count operation and checkpoint to
> ensure that data is loaded, start a timer, run your UDF on the DataFrame,
> checkpoint again or write some output to disk to make sure it finishes and
> then stop the timer and compute how long it takes. I’ll show you some code,
> I have to do this for Graphlet AI’s RTL utils and other tools to figure out
> how much overhead there is using Pandera and Spark together to validate
> data: https://github.com/Graphlet-AI/graphlet
>
> I’ll respond by tomorrow evening with code in a fist! We’ll see if it gets
> consistent, measurable and valid results! :)
>
> Russell Jurney
>
> On Thu, Aug 25, 2022 at 10:00 AM Sean Owen  wrote:
>
>> It's important to realize that while pandas UDFs and pandas on Spark are
>> both related to pandas, they are not themselves directly related. The first
>> lets you use pandas within Spark, the second lets you use pandas on Spark.
>>
>> Hard to say with this info but you want to look at whether you are doing
>> something expensive in each UDF call and consider amortizing it with the
>> scalar iterator UDF pattern. Maybe.
>>
>> A pandas UDF is not spark code itself so no there is no tool in spark to
>> profile it. Conversely any approach to profiling pandas or python would
>> work here .
>>
>> On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta 
>> wrote:
>>
>>> Hi,
>>>
>>> May be I am jumping to conclusions and making stupid guesses, but have
>>> you tried koalas now that it is natively integrated with pyspark??
>>>
>>> Regards
>>> Gourav
>>>
>>> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, <
>>> subashpraba...@gmail.com> wrote:
>>>
 Hi All,

 I was wondering if we have any best practices on using pandas UDF ?
 Profiling UDF is not an easy task and our case requires some drilling down
 on the logic of the function.


 Our use case:
 We are using func(Dataframe) => Dataframe as interface to use Pandas
 UDF, while running locally only the function, it runs faster but when
 executed in Spark environment - the processing time is more than expected.
 We have one column where the value is large (BinaryType -> 600KB),
 wondering whether this could make the Arrow computation slower ?

 Is there any profiling or best way to debug the cost incurred using
 pandas UDF ?


 Thanks,
 Subash

 --
>
> Thanks,
> Russell Jurney @rjurney 
> russell.jur...@gmail.com LI  FB
>  datasyndrome.com
>


-- 
Takuya UESHIN


Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Russell Jurney
Subash, I’m here to help :)

I started a test script to demonstrate a solution last night but got a cold
and haven’t finished it. Give me another day and I’ll get it to you. My
suggestion is that you run PySpark locally in pytest with a fixture to
generate and yield your SparckContext and SparkSession and the. Write tests
that load some test data, perform some count operation and checkpoint to
ensure that data is loaded, start a timer, run your UDF on the DataFrame,
checkpoint again or write some output to disk to make sure it finishes and
then stop the timer and compute how long it takes. I’ll show you some code,
I have to do this for Graphlet AI’s RTL utils and other tools to figure out
how much overhead there is using Pandera and Spark together to validate
data: https://github.com/Graphlet-AI/graphlet

I’ll respond by tomorrow evening with code in a fist! We’ll see if it gets
consistent, measurable and valid results! :)

Russell Jurney

On Thu, Aug 25, 2022 at 10:00 AM Sean Owen  wrote:

> It's important to realize that while pandas UDFs and pandas on Spark are
> both related to pandas, they are not themselves directly related. The first
> lets you use pandas within Spark, the second lets you use pandas on Spark.
>
> Hard to say with this info but you want to look at whether you are doing
> something expensive in each UDF call and consider amortizing it with the
> scalar iterator UDF pattern. Maybe.
>
> A pandas UDF is not spark code itself so no there is no tool in spark to
> profile it. Conversely any approach to profiling pandas or python would
> work here .
>
> On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta 
> wrote:
>
>> Hi,
>>
>> May be I am jumping to conclusions and making stupid guesses, but have
>> you tried koalas now that it is natively integrated with pyspark??
>>
>> Regards
>> Gourav
>>
>> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, 
>> wrote:
>>
>>> Hi All,
>>>
>>> I was wondering if we have any best practices on using pandas UDF ?
>>> Profiling UDF is not an easy task and our case requires some drilling down
>>> on the logic of the function.
>>>
>>>
>>> Our use case:
>>> We are using func(Dataframe) => Dataframe as interface to use Pandas
>>> UDF, while running locally only the function, it runs faster but when
>>> executed in Spark environment - the processing time is more than expected.
>>> We have one column where the value is large (BinaryType -> 600KB),
>>> wondering whether this could make the Arrow computation slower ?
>>>
>>> Is there any profiling or best way to debug the cost incurred using
>>> pandas UDF ?
>>>
>>>
>>> Thanks,
>>> Subash
>>>
>>> --

Thanks,
Russell Jurney @rjurney 
russell.jur...@gmail.com LI  FB
 datasyndrome.com


Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Sean Owen
It's important to realize that while pandas UDFs and pandas on Spark are
both related to pandas, they are not themselves directly related. The first
lets you use pandas within Spark, the second lets you use pandas on Spark.

Hard to say with this info but you want to look at whether you are doing
something expensive in each UDF call and consider amortizing it with the
scalar iterator UDF pattern. Maybe.

A pandas UDF is not spark code itself so no there is no tool in spark to
profile it. Conversely any approach to profiling pandas or python would
work here .

On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta 
wrote:

> Hi,
>
> May be I am jumping to conclusions and making stupid guesses, but have you
> tried koalas now that it is natively integrated with pyspark??
>
> Regards
> Gourav
>
> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, 
> wrote:
>
>> Hi All,
>>
>> I was wondering if we have any best practices on using pandas UDF ?
>> Profiling UDF is not an easy task and our case requires some drilling down
>> on the logic of the function.
>>
>>
>> Our use case:
>> We are using func(Dataframe) => Dataframe as interface to use Pandas UDF,
>> while running locally only the function, it runs faster but when executed
>> in Spark environment - the processing time is more than expected. We have
>> one column where the value is large (BinaryType -> 600KB), wondering
>> whether this could make the Arrow computation slower ?
>>
>> Is there any profiling or best way to debug the cost incurred using
>> pandas UDF ?
>>
>>
>> Thanks,
>> Subash
>>
>>


Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Andrew Melo
Hi Gourav,

Since Koalas needs the same round-trip to/from JVM and Python, I
expect that the performance should be nearly the same for UDFs in
either API

Cheers
Andrew

On Thu, Aug 25, 2022 at 11:22 AM Gourav Sengupta
 wrote:
>
> Hi,
>
> May be I am jumping to conclusions and making stupid guesses, but have you 
> tried koalas now that it is natively integrated with pyspark??
>
> Regards
> Gourav
>
> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham,  
> wrote:
>>
>> Hi All,
>>
>> I was wondering if we have any best practices on using pandas UDF ? 
>> Profiling UDF is not an easy task and our case requires some drilling down 
>> on the logic of the function.
>>
>>
>> Our use case:
>> We are using func(Dataframe) => Dataframe as interface to use Pandas UDF, 
>> while running locally only the function, it runs faster but when executed in 
>> Spark environment - the processing time is more than expected. We have one 
>> column where the value is large (BinaryType -> 600KB), wondering whether 
>> this could make the Arrow computation slower ?
>>
>> Is there any profiling or best way to debug the cost incurred using pandas 
>> UDF ?
>>
>>
>> Thanks,
>> Subash
>>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Gourav Sengupta
Hi,

May be I am jumping to conclusions and making stupid guesses, but have you
tried koalas now that it is natively integrated with pyspark??

Regards
Gourav

On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, 
wrote:

> Hi All,
>
> I was wondering if we have any best practices on using pandas UDF ?
> Profiling UDF is not an easy task and our case requires some drilling down
> on the logic of the function.
>
>
> Our use case:
> We are using func(Dataframe) => Dataframe as interface to use Pandas UDF,
> while running locally only the function, it runs faster but when executed
> in Spark environment - the processing time is more than expected. We have
> one column where the value is large (BinaryType -> 600KB), wondering
> whether this could make the Arrow computation slower ?
>
> Is there any profiling or best way to debug the cost incurred using pandas
> UDF ?
>
>
> Thanks,
> Subash
>
>


Profiling PySpark Pandas UDF

2022-08-25 Thread Subash Prabanantham
Hi All,

I was wondering if we have any best practices on using pandas UDF ?
Profiling UDF is not an easy task and our case requires some drilling down
on the logic of the function.


Our use case:
We are using func(Dataframe) => Dataframe as interface to use Pandas UDF,
while running locally only the function, it runs faster but when executed
in Spark environment - the processing time is more than expected. We have
one column where the value is large (BinaryType -> 600KB), wondering
whether this could make the Arrow computation slower ?

Is there any profiling or best way to debug the cost incurred using pandas
UDF ?


Thanks,
Subash