Re: why the pyspark RDD API is so slow?

Sebastian Piu Mon, 31 Jan 2022 01:14:38 -0800

When you operate on a dataframe from the python side you are just invoking
methods in the JVM via a proxy (py4j) so it is almost as coding in java
itself. This is as long as you don't define any udf's or any other code
that needs to invoke python for processing


Check the High Performance Spark book, the Pyspark chapter, for a good
explanation of what's going on

On Mon, 31 Jan 2022 at 09:10, Bitfox <bit...@bitfox.top> wrote:

> Hi
>
> In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why?
>
> Thanks
>
> On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov <khalidmammad...@gmail.com>
> wrote:
>
>> Your scala program does not use any Spark API hence faster that others.
>> If you write the same code in pure Python I think it will be even faster
>> than Scala program, especially taking into account these 2 programs runs on
>> a single VM.
>>
>> Regarding Dataframe and RDD I would suggest to use Dataframes anyway
>> since it's recommended approach since Spark 2.0.
>> RDD for Pyspark is slow as others said it needs to be
>> serialised/deserialised.
>>
>> One general note is that Spark is written Scala and core is running on
>> JVM and Python is wrapper around Scala API and most of PySpark APIs are
>> delegated to Scala/JVM to be executed. Hence most of big data
>> transformation tasks will complete almost at the same time as they (Scala
>> and Python) use the same API under the hood. Therefore you can also observe
>> that APIs are very similar and code is written in the same fashion.
>>
>>
>> On Sun, 30 Jan 2022, 10:10 Bitfox, <bit...@bitfox.top> wrote:
>>
>>> Hello list,
>>>
>>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
>>> pure scala program. The result shows the pyspark RDD is too slow.
>>>
>>> For the operations and dataset please see:
>>>
>>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>>>
>>> The result table is below.
>>> Can you give suggestions on how to optimize the RDD operation?
>>>
>>> Thanks a lot.
>>>
>>>
>>> *program* *time*
>>> scala program 49s
>>> pyspark dataframe 56s
>>> scala RDD 1m31s
>>> pyspark RDD 7m15s
>>>
>>

Re: why the pyspark RDD API is so slow?

Reply via email to