Thanks Jules ! Appreciate it a lot indeed !
On Mon, Mar 19, 2018 at 7:16 PM, Jules Damji <dmat...@comcast.net> wrote: > What’s your PySpark function? Is it a UDF? If so consider using pandas UDF > introduced in Spark 2.3. > > More info here: https://databricks.com/blog/2017/10/30/introducing- > vectorized-udfs-for-pyspark.html > > > Sent from my iPhone > Pardon the dumb thumb typos :) > > On Mar 18, 2018, at 10:54 PM, Debabrata Ghosh <mailford...@gmail.com> > wrote: > > Hi, > My dataframe is having 2000 rows. For processing each row it > consider 3 seconds and so sequentially it takes 2000 * 3 = 6000 seconds , > which is a very high time. > > Further, I am contemplating to run the function in parallel. > For example, I would like to divide the total rows in my dataframe by 4 and > accordingly I will prepare a set of 500 rows and want to call my pyspark > function in parallel. I wanted to know if there is any library / pyspark > function which I can leverage to do this execution in parallel. > > Will really appreciate for your feedback as per your > earliest convenience. Thanks, > > Debu > >