Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Sebastian Piu
When you operate on a dataframe from the python side you are just invoking
methods in the JVM via a proxy (py4j) so it is almost as coding in java
itself. This is as long as you don't define any udf's or any other code
that needs to invoke python for processing

Check the High Performance Spark book, the Pyspark chapter, for a good
explanation of what's going on

On Mon, 31 Jan 2022 at 09:10, Bitfox  wrote:

> Hi
>
> In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why?
>
> Thanks
>
> On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov 
> wrote:
>
>> Your scala program does not use any Spark API hence faster that others.
>> If you write the same code in pure Python I think it will be even faster
>> than Scala program, especially taking into account these 2 programs runs on
>> a single VM.
>>
>> Regarding Dataframe and RDD I would suggest to use Dataframes anyway
>> since it's recommended approach since Spark 2.0.
>> RDD for Pyspark is slow as others said it needs to be
>> serialised/deserialised.
>>
>> One general note is that Spark is written Scala and core is running on
>> JVM and Python is wrapper around Scala API and most of PySpark APIs are
>> delegated to Scala/JVM to be executed. Hence most of big data
>> transformation tasks will complete almost at the same time as they (Scala
>> and Python) use the same API under the hood. Therefore you can also observe
>> that APIs are very similar and code is written in the same fashion.
>>
>>
>> On Sun, 30 Jan 2022, 10:10 Bitfox,  wrote:
>>
>>> Hello list,
>>>
>>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
>>> pure scala program. The result shows the pyspark RDD is too slow.
>>>
>>> For the operations and dataset please see:
>>>
>>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>>>
>>> The result table is below.
>>> Can you give suggestions on how to optimize the RDD operation?
>>>
>>> Thanks a lot.
>>>
>>>
>>> *program* *time*
>>> scala program 49s
>>> pyspark dataframe 56s
>>> scala RDD 1m31s
>>> pyspark RDD 7m15s
>>>
>>


Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Bitfox
Hi

In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why?

Thanks

On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov 
wrote:

> Your scala program does not use any Spark API hence faster that others. If
> you write the same code in pure Python I think it will be even faster than
> Scala program, especially taking into account these 2 programs runs on a
> single VM.
>
> Regarding Dataframe and RDD I would suggest to use Dataframes anyway since
> it's recommended approach since Spark 2.0.
> RDD for Pyspark is slow as others said it needs to be
> serialised/deserialised.
>
> One general note is that Spark is written Scala and core is running on JVM
> and Python is wrapper around Scala API and most of PySpark APIs are
> delegated to Scala/JVM to be executed. Hence most of big data
> transformation tasks will complete almost at the same time as they (Scala
> and Python) use the same API under the hood. Therefore you can also observe
> that APIs are very similar and code is written in the same fashion.
>
>
> On Sun, 30 Jan 2022, 10:10 Bitfox,  wrote:
>
>> Hello list,
>>
>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
>> pure scala program. The result shows the pyspark RDD is too slow.
>>
>> For the operations and dataset please see:
>>
>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>>
>> The result table is below.
>> Can you give suggestions on how to optimize the RDD operation?
>>
>> Thanks a lot.
>>
>>
>> *program* *time*
>> scala program 49s
>> pyspark dataframe 56s
>> scala RDD 1m31s
>> pyspark RDD 7m15s
>>
>


Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Khalid Mammadov
Your scala program does not use any Spark API hence faster that others. If
you write the same code in pure Python I think it will be even faster than
Scala program, especially taking into account these 2 programs runs on a
single VM.

Regarding Dataframe and RDD I would suggest to use Dataframes anyway since
it's recommended approach since Spark 2.0.
RDD for Pyspark is slow as others said it needs to be
serialised/deserialised.

One general note is that Spark is written Scala and core is running on JVM
and Python is wrapper around Scala API and most of PySpark APIs are
delegated to Scala/JVM to be executed. Hence most of big data
transformation tasks will complete almost at the same time as they (Scala
and Python) use the same API under the hood. Therefore you can also observe
that APIs are very similar and code is written in the same fashion.


On Sun, 30 Jan 2022, 10:10 Bitfox,  wrote:

> Hello list,
>
> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
> pure scala program. The result shows the pyspark RDD is too slow.
>
> For the operations and dataset please see:
>
> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>
> The result table is below.
> Can you give suggestions on how to optimize the RDD operation?
>
> Thanks a lot.
>
>
> *program* *time*
> scala program 49s
> pyspark dataframe 56s
> scala RDD 1m31s
> pyspark RDD 7m15s
>


RE: why the pyspark RDD API is so slow?

2022-01-30 Thread Theodore J Griesenbrock
Any particular code sample you can suggest to review on your tips?

> On Jan 30, 2022, at 06:16, Sebastian Piu  wrote:
> 
> 
> This Message Is From an External Sender
> This message came from outside your organization.
> It's because all data needs to be pickled back and forth between java and a 
> spun python worker, so there is additional overhead than if you stay fully in 
> scala. 
> 
> Your python code might make this worse too, for example if not yielding from 
> operations
> 
> You can look at using UDFs and arrow or trying to stay as much as possible on 
> datagrams operations only
> 
>> On Sun, 30 Jan 2022, 10:11 Bitfox,  wrote:
>> Hello list,
>> 
>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a pure 
>> scala program. The result shows the pyspark RDD is too slow.
>> 
>> For the operations and dataset please see:
>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>> 
>> The result table is below.
>> Can you give suggestions on how to optimize the RDD operation?
>> 
>> Thanks a lot.
>> 
>> 
>> program  time
>> scala program49s
>> pyspark dataframe56s
>> scala RDD1m31s
>> pyspark RDD  7m15s



Re: why the pyspark RDD API is so slow?

2022-01-30 Thread Sebastian Piu
It's because all data needs to be pickled back and forth between java and a
spun python worker, so there is additional overhead than if you stay fully
in scala.

Your python code might make this worse too, for example if not yielding
from operations

You can look at using UDFs and arrow or trying to stay as much as possible
on datagrams operations only

On Sun, 30 Jan 2022, 10:11 Bitfox,  wrote:

> Hello list,
>
> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
> pure scala program. The result shows the pyspark RDD is too slow.
>
> For the operations and dataset please see:
>
> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>
> The result table is below.
> Can you give suggestions on how to optimize the RDD operation?
>
> Thanks a lot.
>
>
> *program* *time*
> scala program 49s
> pyspark dataframe 56s
> scala RDD 1m31s
> pyspark RDD 7m15s
>