why the pyspark RDD API is so slow?

Bitfox Sun, 30 Jan 2022 02:10:52 -0800

Hello list,

I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a pure
scala program. The result shows the pyspark RDD is too slow.


For the operations and dataset please see:
https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/

The result table is below.
Can you give suggestions on how to optimize the RDD operation?

Thanks a lot.


*program* *time*
scala program 49s
pyspark dataframe 56s
scala RDD 1m31s
pyspark RDD 7m15s

why the pyspark RDD API is so slow?

Reply via email to