Hi In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why?
Thanks On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov <khalidmammad...@gmail.com> wrote: > Your scala program does not use any Spark API hence faster that others. If > you write the same code in pure Python I think it will be even faster than > Scala program, especially taking into account these 2 programs runs on a > single VM. > > Regarding Dataframe and RDD I would suggest to use Dataframes anyway since > it's recommended approach since Spark 2.0. > RDD for Pyspark is slow as others said it needs to be > serialised/deserialised. > > One general note is that Spark is written Scala and core is running on JVM > and Python is wrapper around Scala API and most of PySpark APIs are > delegated to Scala/JVM to be executed. Hence most of big data > transformation tasks will complete almost at the same time as they (Scala > and Python) use the same API under the hood. Therefore you can also observe > that APIs are very similar and code is written in the same fashion. > > > On Sun, 30 Jan 2022, 10:10 Bitfox, <bit...@bitfox.top> wrote: > >> Hello list, >> >> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a >> pure scala program. The result shows the pyspark RDD is too slow. >> >> For the operations and dataset please see: >> >> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/ >> >> The result table is below. >> Can you give suggestions on how to optimize the RDD operation? >> >> Thanks a lot. >> >> >> *program* *time* >> scala program 49s >> pyspark dataframe 56s >> scala RDD 1m31s >> pyspark RDD 7m15s >> >