Re: Python performance

2022-02-06 Thread Hinko Kocevar
Thanks for your input guys!   //hinko

On 4 Feb 2022, at 14:58, Sean Owen  wrote:


Yes, in the sense that any transformation that can be expressed in the SQL-like 
DataFrame API will push down to the JVM, and take advantage of other 
optimizations, avoiding the data movement to/from Python and more.
But you can't do this if you're expressing operations that are not in the 
DataFrame API, custom logic. They are not always alternatives.

There, pandas UDFs are a better choice in python as you can take advantage of 
arrow for data movement, and that is also a reason to use DataFrames in a case 
like this. It still has to execute code in Python though.

On Fri, Feb 4, 2022 at 3:20 AM Bitfox  wrote:
Please see my this test:
https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/

Don’t use Python RDD, using dataframe instead.

Regards

On Fri, Feb 4, 2022 at 5:02 PM Hinko Kocevar  
wrote:
I'm looking into using Python interface with Spark and came across this [1] 
chart showing some performance hit when going with Python RDD. Data is ~ 7 
years and for older version of Spark. Is this still the case with more recent 
Spark releases?

I'm trying to understand what to expect from Python and Spark and under what 
conditions.

[1] 
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

Thanks,
//hinko
-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org



Re: Python performance

2022-02-04 Thread Sean Owen
Yes, in the sense that any transformation that can be expressed in the
SQL-like DataFrame API will push down to the JVM, and take advantage of
other optimizations, avoiding the data movement to/from Python and more.
But you can't do this if you're expressing operations that are not in the
DataFrame API, custom logic. They are not always alternatives.

There, pandas UDFs are a better choice in python as you can take advantage
of arrow for data movement, and that is also a reason to use DataFrames in
a case like this. It still has to execute code in Python though.

On Fri, Feb 4, 2022 at 3:20 AM Bitfox  wrote:

> Please see my this test:
>
> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>
> Don’t use Python RDD, using dataframe instead.
>
> Regards
>
> On Fri, Feb 4, 2022 at 5:02 PM Hinko Kocevar 
> wrote:
>
>> I'm looking into using Python interface with Spark and came across this
>> [1] chart showing some performance hit when going with Python RDD. Data is
>> ~ 7 years and for older version of Spark. Is this still the case with more
>> recent Spark releases?
>>
>> I'm trying to understand what to expect from Python and Spark and under
>> what conditions.
>>
>> [1]
>> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>>
>> Thanks,
>> //hinko
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Python performance

2022-02-04 Thread Bitfox
Please see my this test:
https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/

Don’t use Python RDD, using dataframe instead.

Regards

On Fri, Feb 4, 2022 at 5:02 PM Hinko Kocevar 
wrote:

> I'm looking into using Python interface with Spark and came across this
> [1] chart showing some performance hit when going with Python RDD. Data is
> ~ 7 years and for older version of Spark. Is this still the case with more
> recent Spark releases?
>
> I'm trying to understand what to expect from Python and Spark and under
> what conditions.
>
> [1]
> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>
> Thanks,
> //hinko
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>