Thanks for your input guys!   //hinko

On 4 Feb 2022, at 14:58, Sean Owen <sro...@gmail.com> wrote:


Yes, in the sense that any transformation that can be expressed in the SQL-like 
DataFrame API will push down to the JVM, and take advantage of other 
optimizations, avoiding the data movement to/from Python and more.
But you can't do this if you're expressing operations that are not in the 
DataFrame API, custom logic. They are not always alternatives.

There, pandas UDFs are a better choice in python as you can take advantage of 
arrow for data movement, and that is also a reason to use DataFrames in a case 
like this. It still has to execute code in Python though.

On Fri, Feb 4, 2022 at 3:20 AM Bitfox <bit...@bitfox.top> wrote:
Please see my this test:
https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/

Don’t use Python RDD, using dataframe instead.

Regards

On Fri, Feb 4, 2022 at 5:02 PM Hinko Kocevar <hinko.koce...@ess.eu.invalid> 
wrote:
I'm looking into using Python interface with Spark and came across this [1] 
chart showing some performance hit when going with Python RDD. Data is ~ 7 
years and for older version of Spark. Is this still the case with more recent 
Spark releases?

I'm trying to understand what to expect from Python and Spark and under what 
conditions.

[1] 
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

Thanks,
//hinko
---------------------------------------------------------------------
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

Reply via email to