It's because all data needs to be pickled back and forth between java and a spun python worker, so there is additional overhead than if you stay fully in scala.
Your python code might make this worse too, for example if not yielding from operations You can look at using UDFs and arrow or trying to stay as much as possible on datagrams operations only On Sun, 30 Jan 2022, 10:11 Bitfox, <bit...@bitfox.top> wrote: > Hello list, > > I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a > pure scala program. The result shows the pyspark RDD is too slow. > > For the operations and dataset please see: > > https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/ > > The result table is below. > Can you give suggestions on how to optimize the RDD operation? > > Thanks a lot. > > > *program* *time* > scala program 49s > pyspark dataframe 56s > scala RDD 1m31s > pyspark RDD 7m15s >