It's because all data needs to be pickled back and forth between java and a
spun python worker, so there is additional overhead than if you stay fully
in scala.

Your python code might make this worse too, for example if not yielding
from operations

You can look at using UDFs and arrow or trying to stay as much as possible
on datagrams operations only

On Sun, 30 Jan 2022, 10:11 Bitfox, <bit...@bitfox.top> wrote:

> Hello list,
>
> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
> pure scala program. The result shows the pyspark RDD is too slow.
>
> For the operations and dataset please see:
>
> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>
> The result table is below.
> Can you give suggestions on how to optimize the RDD operation?
>
> Thanks a lot.
>
>
> *program* *time*
> scala program 49s
> pyspark dataframe 56s
> scala RDD 1m31s
> pyspark RDD 7m15s
>

Reply via email to