Hello list,

I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a pure
scala program. The result shows the pyspark RDD is too slow.

For the operations and dataset please see:
https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/

The result table is below.
Can you give suggestions on how to optimize the RDD operation?

Thanks a lot.


*program* *time*
scala program 49s
pyspark dataframe 56s
scala RDD 1m31s
pyspark RDD 7m15s

Reply via email to