Re: Profiling PySpark Pandas UDF

2022-08-29 Thread Gourav Sengupta
> *Sent:* Friday, August 26, 2022 15:59 > *To:* Luca Canali > *Cc:* Russell Jurney ; Gourav Sengupta < > gourav.sengu...@gmail.com>; Sean Owen ; Takuya UESHIN < > ues...@happy-camper.st>; user ; Subash > Prabanantham > *Subject:* Re: Profiling PySpark Pandas UDF >

RE: Profiling PySpark Pandas UDF

2022-08-29 Thread Luca Canali
From: Abdeali Kothari Sent: Friday, August 26, 2022 15:59 To: Luca Canali Cc: Russell Jurney ; Gourav Sengupta ; Sean Owen ; Takuya UESHIN ; user ; Subash Prabanantham Subject: Re: Profiling PySpark Pandas UDF Hi Luca, I see you pushed some code to the PR 3 hrs ago. That's awesome. If I ca

Re: Profiling PySpark Pandas UDF

2022-08-26 Thread Abdeali Kothari
m:* Abdeali Kothari > *Sent:* Friday, August 26, 2022 06:36 > *To:* Subash Prabanantham > *Cc:* Russell Jurney ; Gourav Sengupta < > gourav.sengu...@gmail.com>; Sean Owen ; Takuya UESHIN < > ues...@happy-camper.st>; user > *Subject:* Re: Profiling PySpark Pandas UDF >

RE: Profiling PySpark Pandas UDF

2022-08-26 Thread Luca Canali
Kothari Sent: Friday, August 26, 2022 06:36 To: Subash Prabanantham Cc: Russell Jurney ; Gourav Sengupta ; Sean Owen ; Takuya UESHIN ; user Subject: Re: Profiling PySpark Pandas UDF The python profiler is pretty cool ! Ill try it out to see what could be taking time within the UDF

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Abdeali Kothari
The python profiler is pretty cool ! Ill try it out to see what could be taking time within the UDF with it. I'm wondering if there is also some lightweight profiling (which does not slow down my processing) for me to get: - how much time the UDF took (like how much time was spent inside the

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Subash Prabanantham
Wow, lots of good suggestions. I didn’t know about the profiler either. Great suggestion @Takuya. Thanks, Subash On Thu, 25 Aug 2022 at 19:30, Russell Jurney wrote: > YOU know what you're talking about and aren't hacking a solution. You are > my new friend :) Thank you, this is incredibly

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Russell Jurney
YOU know what you're talking about and aren't hacking a solution. You are my new friend :) Thank you, this is incredibly helpful! Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI FB

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Sean Owen
Oh whoa I didn't realize we had this! I stand corrected On Thu, Aug 25, 2022, 12:52 PM Takuya UESHIN wrote: > Hi Subash, > > Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3? > - > https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Takuya UESHIN
Hi Subash, Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3? - https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf Hope it can help you. Thanks. On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney wrote: > Subash, I’m here to help :)

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Russell Jurney
Subash, I’m here to help :) I started a test script to demonstrate a solution last night but got a cold and haven’t finished it. Give me another day and I’ll get it to you. My suggestion is that you run PySpark locally in pytest with a fixture to generate and yield your SparckContext and

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Sean Owen
It's important to realize that while pandas UDFs and pandas on Spark are both related to pandas, they are not themselves directly related. The first lets you use pandas within Spark, the second lets you use pandas on Spark. Hard to say with this info but you want to look at whether you are doing

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Andrew Melo
Hi Gourav, Since Koalas needs the same round-trip to/from JVM and Python, I expect that the performance should be nearly the same for UDFs in either API Cheers Andrew On Thu, Aug 25, 2022 at 11:22 AM Gourav Sengupta wrote: > > Hi, > > May be I am jumping to conclusions and making stupid

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Gourav Sengupta
Hi, May be I am jumping to conclusions and making stupid guesses, but have you tried koalas now that it is natively integrated with pyspark?? Regards Gourav On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, wrote: > Hi All, > > I was wondering if we have any best practices on using pandas UDF ?

Profiling PySpark Pandas UDF

2022-08-25 Thread Subash Prabanantham
Hi All, I was wondering if we have any best practices on using pandas UDF ? Profiling UDF is not an easy task and our case requires some drilling down on the logic of the function. Our use case: We are using func(Dataframe) => Dataframe as interface to use Pandas UDF, while running locally only

Re: PySpark Pandas UDF

2019-11-17 Thread Gourav Sengupta
Hi, sorry a completely unrelated question. when is the upcoming release of SPARK 3.0. There are several parallel distributed deep learning frameworks that are being developed, do you think that we could use SPARK 3.0 for distributed deep learning using Pytorch or Tensorflow? Is there any place

Re: PySpark Pandas UDF

2019-11-17 Thread Bryan Cutler
There was a change in the binary format of Arrow 0.15.1 and there is an environment variable you can set to make pyarrow 0.15.1 compatible with current Spark, which looks to be your problem. Please see the doc below for instructions added in SPARK-2936. Note, this will not be required for the

Re: PySpark Pandas UDF

2019-11-12 Thread Holden Karau
Thanks for sharing that. I think we should maybe add some checks around this so it’s easier to debug. I’m CCing Bryan who might have some thoughts. On Tue, Nov 12, 2019 at 7:42 AM gal.benshlomo wrote: > SOLVED! > thanks for the help - I found the issue. it was the version of pyarrow > (0.15.1)

RE: PySpark Pandas UDF

2019-11-12 Thread gal.benshlomo
SOLVED! thanks for the help - I found the issue. it was the version of pyarrow (0.15.1) which apparently isn't currently stable. Downgrading it solved the issue for me -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: PySpark Pandas UDF

2019-11-11 Thread gal.benshlomo
Hi, Thanks for your reply. Tried what you've suggested and still getting the same error. Also worth mentioning that when I tried to simply write the dataframe to S3, without applying the function, it works. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: PySpark Pandas UDF

2019-11-10 Thread Holden Karau
Can you switch the write for a count just so we can isolate if it’s the write or the count? Also what’s the output path your using? On Sun, Nov 10, 2019 at 7:31 AM Gal Benshlomo wrote: > > > Hi, > > > > I’m using pandas_udf and not able to run it from cluster mode, even though > the same code

RE: PySpark Pandas UDF

2019-11-10 Thread Gal Benshlomo
Hi, I'm using pandas_udf and not able to run it from cluster mode, even though the same code works on standalone. The code is as follows: schema_test = StructType([ StructField("cluster", LongType()), StructField("name", StringType()) ]) @pandas_udf(schema_test,

Re: pySpark - pandas UDF and binaryType

2019-05-04 Thread Gourav Sengupta
just try using an apply on a series for a custom function or on any other library. Advertisement and actual delivery are two different skills altogether. Not everyone wants to add a one to their column using the pandas udf as one of their links shows :) Most of the actual used cases are more

Re: pySpark - pandas UDF and binaryType

2019-05-04 Thread Nicolas Paris
hi Gourav, > And also be aware that pandas UDF does not always lead to better performance > and sometimes even massively slow performance. this information is not widely spread. this is good to know. in which circumstances is it worst than regular udf ? > With Grouped Map dont you run into the

Re: pySpark - pandas UDF and binaryType

2019-05-03 Thread Gourav Sengupta
And also be aware that pandas UDF does not always lead to better performance and sometimes even massively slow performance. With Grouped Map dont you run into the risk of random memory errors as well? On Thu, May 2, 2019 at 9:32 PM Bryan Cutler wrote: > Hi, > > BinaryType support was not added

Re: pySpark - pandas UDF and binaryType

2019-05-02 Thread Bryan Cutler
Hi, BinaryType support was not added until Spark 2.4.0, see https://issues.apache.org/jira/browse/SPARK-23555. Also, pyarrow 0.10.0 or greater is require as you saw in the docs. Bryan On Thu, May 2, 2019 at 4:26 AM Nicolas Paris wrote: > Hi all > > I am using pySpark 2.3.0 and pyArrow 0.10.0

pySpark - pandas UDF and binaryType

2019-05-02 Thread Nicolas Paris
Hi all I am using pySpark 2.3.0 and pyArrow 0.10.0 I want to apply a pandas-udf on a dataframe with I have the bellow error: > Invalid returnType with grouped map Pandas UDFs: > StructType(List(StructField(filename,StringType,true),StructField(contents,BinaryType,true))) > is not supported