Re: why the pyspark RDD API is so slow?
When you operate on a dataframe from the python side you are just invoking methods in the JVM via a proxy (py4j) so it is almost as coding in java itself. This is as long as you don't define any udf's or any other code that needs to invoke python for processing Check the High Performance Spark book, the Pyspark chapter, for a good explanation of what's going on On Mon, 31 Jan 2022 at 09:10, Bitfox wrote: > Hi > > In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why? > > Thanks > > On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov > wrote: > >> Your scala program does not use any Spark API hence faster that others. >> If you write the same code in pure Python I think it will be even faster >> than Scala program, especially taking into account these 2 programs runs on >> a single VM. >> >> Regarding Dataframe and RDD I would suggest to use Dataframes anyway >> since it's recommended approach since Spark 2.0. >> RDD for Pyspark is slow as others said it needs to be >> serialised/deserialised. >> >> One general note is that Spark is written Scala and core is running on >> JVM and Python is wrapper around Scala API and most of PySpark APIs are >> delegated to Scala/JVM to be executed. Hence most of big data >> transformation tasks will complete almost at the same time as they (Scala >> and Python) use the same API under the hood. Therefore you can also observe >> that APIs are very similar and code is written in the same fashion. >> >> >> On Sun, 30 Jan 2022, 10:10 Bitfox, wrote: >> >>> Hello list, >>> >>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a >>> pure scala program. The result shows the pyspark RDD is too slow. >>> >>> For the operations and dataset please see: >>> >>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/ >>> >>> The result table is below. >>> Can you give suggestions on how to optimize the RDD operation? >>> >>> Thanks a lot. >>> >>> >>> *program* *time* >>> scala program 49s >>> pyspark dataframe 56s >>> scala RDD 1m31s >>> pyspark RDD 7m15s >>> >>
Re: why the pyspark RDD API is so slow?
Hi In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why? Thanks On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov wrote: > Your scala program does not use any Spark API hence faster that others. If > you write the same code in pure Python I think it will be even faster than > Scala program, especially taking into account these 2 programs runs on a > single VM. > > Regarding Dataframe and RDD I would suggest to use Dataframes anyway since > it's recommended approach since Spark 2.0. > RDD for Pyspark is slow as others said it needs to be > serialised/deserialised. > > One general note is that Spark is written Scala and core is running on JVM > and Python is wrapper around Scala API and most of PySpark APIs are > delegated to Scala/JVM to be executed. Hence most of big data > transformation tasks will complete almost at the same time as they (Scala > and Python) use the same API under the hood. Therefore you can also observe > that APIs are very similar and code is written in the same fashion. > > > On Sun, 30 Jan 2022, 10:10 Bitfox, wrote: > >> Hello list, >> >> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a >> pure scala program. The result shows the pyspark RDD is too slow. >> >> For the operations and dataset please see: >> >> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/ >> >> The result table is below. >> Can you give suggestions on how to optimize the RDD operation? >> >> Thanks a lot. >> >> >> *program* *time* >> scala program 49s >> pyspark dataframe 56s >> scala RDD 1m31s >> pyspark RDD 7m15s >> >
Re: why the pyspark RDD API is so slow?
Your scala program does not use any Spark API hence faster that others. If you write the same code in pure Python I think it will be even faster than Scala program, especially taking into account these 2 programs runs on a single VM. Regarding Dataframe and RDD I would suggest to use Dataframes anyway since it's recommended approach since Spark 2.0. RDD for Pyspark is slow as others said it needs to be serialised/deserialised. One general note is that Spark is written Scala and core is running on JVM and Python is wrapper around Scala API and most of PySpark APIs are delegated to Scala/JVM to be executed. Hence most of big data transformation tasks will complete almost at the same time as they (Scala and Python) use the same API under the hood. Therefore you can also observe that APIs are very similar and code is written in the same fashion. On Sun, 30 Jan 2022, 10:10 Bitfox, wrote: > Hello list, > > I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a > pure scala program. The result shows the pyspark RDD is too slow. > > For the operations and dataset please see: > > https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/ > > The result table is below. > Can you give suggestions on how to optimize the RDD operation? > > Thanks a lot. > > > *program* *time* > scala program 49s > pyspark dataframe 56s > scala RDD 1m31s > pyspark RDD 7m15s >
RE: why the pyspark RDD API is so slow?
Any particular code sample you can suggest to review on your tips? > On Jan 30, 2022, at 06:16, Sebastian Piu wrote: > > > This Message Is From an External Sender > This message came from outside your organization. > It's because all data needs to be pickled back and forth between java and a > spun python worker, so there is additional overhead than if you stay fully in > scala. > > Your python code might make this worse too, for example if not yielding from > operations > > You can look at using UDFs and arrow or trying to stay as much as possible on > datagrams operations only > >> On Sun, 30 Jan 2022, 10:11 Bitfox, wrote: >> Hello list, >> >> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a pure >> scala program. The result shows the pyspark RDD is too slow. >> >> For the operations and dataset please see: >> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/ >> >> The result table is below. >> Can you give suggestions on how to optimize the RDD operation? >> >> Thanks a lot. >> >> >> program time >> scala program49s >> pyspark dataframe56s >> scala RDD1m31s >> pyspark RDD 7m15s
Re: why the pyspark RDD API is so slow?
It's because all data needs to be pickled back and forth between java and a spun python worker, so there is additional overhead than if you stay fully in scala. Your python code might make this worse too, for example if not yielding from operations You can look at using UDFs and arrow or trying to stay as much as possible on datagrams operations only On Sun, 30 Jan 2022, 10:11 Bitfox, wrote: > Hello list, > > I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a > pure scala program. The result shows the pyspark RDD is too slow. > > For the operations and dataset please see: > > https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/ > > The result table is below. > Can you give suggestions on how to optimize the RDD operation? > > Thanks a lot. > > > *program* *time* > scala program 49s > pyspark dataframe 56s > scala RDD 1m31s > pyspark RDD 7m15s >