I have seen a talk by Brian Clapper in NE-SCALA 2016 - RDDs, DataFrames and 
Datasets @ Apache Spark - NE Scala 2016

At 15:00 there is a slide to show a comparison of aggregating 10 Million 
integer pairs using RDD ,  DataFrame with different language bindings like 
Scala , Python , R

As per this slide
DataFrame APIs outperform RDDs and all the Language bindings performance are 
same
RDD with Python is way slower than Scala version So I guess there should be 
some reality in Scala bindings being faster in some case.

@ 30:23 he presents a slide to show the performance of serialization and 
Dataset encoders are way faster than Java and Kyro.

But as always proof of pudding is in eating so why don’t you try some samples 
to see yourself.
I personally have found that my app runs a bit faster with Scala version than 
Java but I am not yet able to figure out the reason.


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: 02 September 2016 15:25
To: Tal Grynbaum
Cc: darren; Mich Talebzadeh; Jakob Odersky; kant kodali; AssafMendelson; user
Subject: Re: Scala Vs Python

Tal: I think by nature of the project itself, Python APIs are developed after 
Scala and Java, and it is a fair trade off between speed of getting stuff to 
market. And more and more this discussion is progressing, I see not much issue 
in terms of feature parity.

Coming back to performance, Darren raised a good point: if I can scale out, 
individual VM performance should not matter much. But performance is often 
stated as a definitive downside of using Python over scala/java. I am trying to 
understand the truth and myth behind this claim. Any pointer would be great.

best
Ayan

On Fri, Sep 2, 2016 at 4:10 PM, Tal Grynbaum 
<tal.grynb...@gmail.com<mailto:tal.grynb...@gmail.com>> wrote:

On Fri, Sep 2, 2016 at 1:15 AM, darren 
<dar...@ontrenet.com<mailto:dar...@ontrenet.com>> wrote:
This topic is a concern for us as well. In the data science world no one uses 
native scala or java by choice. It's R and Python. And python is growing. Yet 
in spark, python is 3rd in line for feature support, if at all.

This is why we have decoupled from spark in our project. It's really 
unfortunate spark team have invested so heavily in scale.

As for speed it comes from horizontal scaling and throughout. When you can 
scale outward, individual VM performance is less an issue. Basic HPC principles.

Darren,

My guess is that data scientist who will decouple themselves from spark, will 
eventually left with more or less nothing. (single process capabilities, or 
purely performing HPC's) (unless, unlikely, some good spark competitor will 
emerge.  unlikely, simply because there is no need for such).
But putting guessing aside - the reason python is 3rd in line for feature 
support, is not because the spark developers were busy with scala, it's because 
the features that are missing are those that support strong typing. which is 
not relevant to python.  in other words, even if spark was rewritten in python, 
and was to focus on python only, you would still not get those features.



--
Tal Grynbaum / CTO & co-founder

m# +972-54-7875797
[cid:image001.png@01D20532.AC944EB0]
        mobile retention done right



--
Best Regards,
Ayan Guha

Reply via email to