Hello list, I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a pure scala program. The result shows the pyspark RDD is too slow.
For the operations and dataset please see: https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/ The result table is below. Can you give suggestions on how to optimize the RDD operation? Thanks a lot. *program* *time* scala program 49s pyspark dataframe 56s scala RDD 1m31s pyspark RDD 7m15s