Re: why the pyspark RDD API is so slow?

Bitfox Mon, 31 Jan 2022 01:09:42 -0800

Hi

In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why?


Thanks

On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov <khalidmammad...@gmail.com>
wrote:

> Your scala program does not use any Spark API hence faster that others. If
> you write the same code in pure Python I think it will be even faster than
> Scala program, especially taking into account these 2 programs runs on a
> single VM.
>
> Regarding Dataframe and RDD I would suggest to use Dataframes anyway since
> it's recommended approach since Spark 2.0.
> RDD for Pyspark is slow as others said it needs to be
> serialised/deserialised.
>
> One general note is that Spark is written Scala and core is running on JVM
> and Python is wrapper around Scala API and most of PySpark APIs are
> delegated to Scala/JVM to be executed. Hence most of big data
> transformation tasks will complete almost at the same time as they (Scala
> and Python) use the same API under the hood. Therefore you can also observe
> that APIs are very similar and code is written in the same fashion.
>
>
> On Sun, 30 Jan 2022, 10:10 Bitfox, <bit...@bitfox.top> wrote:
>
>> Hello list,
>>
>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
>> pure scala program. The result shows the pyspark RDD is too slow.
>>
>> For the operations and dataset please see:
>>
>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>>
>> The result table is below.
>> Can you give suggestions on how to optimize the RDD operation?
>>
>> Thanks a lot.
>>
>>
>> *program* *time*
>> scala program 49s
>> pyspark dataframe 56s
>> scala RDD 1m31s
>> pyspark RDD 7m15s
>>
>

Re: why the pyspark RDD API is so slow?

Reply via email to