Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Sebastian Piu
When you operate on a dataframe from the python side you are just invoking methods in the JVM via a proxy (py4j) so it is almost as coding in java itself. This is as long as you don't define any udf's or any other code that needs to invoke python for processing Check the High Performance Spark

Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Bitfox
Hi In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why? Thanks On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov wrote: > Your scala program does not use any Spark API hence faster that others. If > you write the same code in pure Python I think it will be even faster than

Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Khalid Mammadov
Your scala program does not use any Spark API hence faster that others. If you write the same code in pure Python I think it will be even faster than Scala program, especially taking into account these 2 programs runs on a single VM. Regarding Dataframe and RDD I would suggest to use Dataframes

RE: why the pyspark RDD API is so slow?

2022-01-30 Thread Theodore J Griesenbrock
Any particular code sample you can suggest to review on your tips? > On Jan 30, 2022, at 06:16, Sebastian Piu wrote: > >  > This Message Is From an External Sender > This message came from outside your organization. > It's because all data needs to be pickled back and forth between java and a

Re: why the pyspark RDD API is so slow?

2022-01-30 Thread Sebastian Piu
It's because all data needs to be pickled back and forth between java and a spun python worker, so there is additional overhead than if you stay fully in scala. Your python code might make this worse too, for example if not yielding from operations You can look at using UDFs and arrow or trying