Re: Calling Pyspark functions in parallel

Jules Damji Mon, 19 Mar 2018 06:47:24 -0700

What’s your PySpark function? Is it a UDF? If so consider using pandas UDF 
introduced in Spark 2.3.


More info here: 
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html


Sent from my iPhone
Pardon the dumb thumb typos :)

> On Mar 18, 2018, at 10:54 PM, Debabrata Ghosh <mailford...@gmail.com> wrote:
> 
> Hi,
>              My dataframe is having 2000 rows. For processing each row it 
> consider 3 seconds and so sequentially it takes 2000 * 3 = 6000 seconds , 
> which is a very high time. 
> 
>               Further, I am contemplating to run the function in parallel. 
> For example, I would like to divide the total rows in my dataframe by 4 and 
> accordingly I will prepare a set of 500 rows and want to call my pyspark 
> function in parallel. I wanted to know if there is any library / pyspark 
> function which I can leverage to do this execution in parallel.
> 
>                Will really appreciate for your feedback as per your earliest 
> convenience. Thanks,
> 
> Debu

Re: Calling Pyspark functions in parallel

Reply via email to