Re: Python performance

Hinko Kocevar Sun, 06 Feb 2022 03:49:24 -0800

Thanks for your input guys!   //hinko

On 4 Feb 2022, at 14:58, Sean Owen <sro...@gmail.com> wrote:

Yes, in the sense that any transformation that can be expressed in the SQL-like
DataFrame API will push down to the JVM, and take advantage of other
optimizations, avoiding the data movement to/from Python and more.
But you can't do this if you're expressing operations that are not in the
DataFrame API, custom logic. They are not always alternatives.

There, pandas UDFs are a better choice in python as you can take advantage of
arrow for data movement, and that is also a reason to use DataFrames in a case
like this. It still has to execute code in Python though.

On Fri, Feb 4, 2022 at 3:20 AM Bitfox <bit...@bitfox.top> wrote:
Please see my this test:
https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/

Don’t use Python RDD, using dataframe instead.

Regards

On Fri, Feb 4, 2022 at 5:02 PM Hinko Kocevar <hinko.koce...@ess.eu.invalid>
wrote:
I'm looking into using Python interface with Spark and came across this [1]
chart showing some performance hit when going with Python RDD. Data is ~ 7
years and for older version of Spark. Is this still the case with more recent
Spark releases?

I'm trying to understand what to expect from Python and Spark and under what
conditions.

[1]
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

Thanks,
//hinko
---------------------------------------------------------------------
To unsubscribe e-mail:
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

Re: Python performance

Reply via email to