Your scala program does not use any Spark API hence faster that others. If
you write the same code in pure Python I think it will be even faster than
Scala program, especially taking into account these 2 programs runs on a
single VM.

Regarding Dataframe and RDD I would suggest to use Dataframes anyway since
it's recommended approach since Spark 2.0.
RDD for Pyspark is slow as others said it needs to be
serialised/deserialised.

One general note is that Spark is written Scala and core is running on JVM
and Python is wrapper around Scala API and most of PySpark APIs are
delegated to Scala/JVM to be executed. Hence most of big data
transformation tasks will complete almost at the same time as they (Scala
and Python) use the same API under the hood. Therefore you can also observe
that APIs are very similar and code is written in the same fashion.


On Sun, 30 Jan 2022, 10:10 Bitfox, <bit...@bitfox.top> wrote:

> Hello list,
>
> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
> pure scala program. The result shows the pyspark RDD is too slow.
>
> For the operations and dataset please see:
>
> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>
> The result table is below.
> Can you give suggestions on how to optimize the RDD operation?
>
> Thanks a lot.
>
>
> *program* *time*
> scala program 49s
> pyspark dataframe 56s
> scala RDD 1m31s
> pyspark RDD 7m15s
>

Reply via email to