I have seen a talk by Brian Clapper in NE-SCALA 2016 - RDDs, DataFrames and Datasets @ Apache Spark - NE Scala 2016
At 15:00 there is a slide to show a comparison of aggregating 10 Million integer pairs using RDD , DataFrame with different language bindings like Scala , Python , R As per this slide DataFrame APIs outperform RDDs and all the Language bindings performance are same RDD with Python is way slower than Scala version So I guess there should be some reality in Scala bindings being faster in some case. @ 30:23 he presents a slide to show the performance of serialization and Dataset encoders are way faster than Java and Kyro. But as always proof of pudding is in eating so why don’t you try some samples to see yourself. I personally have found that my app runs a bit faster with Scala version than Java but I am not yet able to figure out the reason. From: ayan guha [mailto:guha.a...@gmail.com] Sent: 02 September 2016 15:25 To: Tal Grynbaum Cc: darren; Mich Talebzadeh; Jakob Odersky; kant kodali; AssafMendelson; user Subject: Re: Scala Vs Python Tal: I think by nature of the project itself, Python APIs are developed after Scala and Java, and it is a fair trade off between speed of getting stuff to market. And more and more this discussion is progressing, I see not much issue in terms of feature parity. Coming back to performance, Darren raised a good point: if I can scale out, individual VM performance should not matter much. But performance is often stated as a definitive downside of using Python over scala/java. I am trying to understand the truth and myth behind this claim. Any pointer would be great. best Ayan On Fri, Sep 2, 2016 at 4:10 PM, Tal Grynbaum <tal.grynb...@gmail.com<mailto:tal.grynb...@gmail.com>> wrote: On Fri, Sep 2, 2016 at 1:15 AM, darren <dar...@ontrenet.com<mailto:dar...@ontrenet.com>> wrote: This topic is a concern for us as well. In the data science world no one uses native scala or java by choice. It's R and Python. And python is growing. Yet in spark, python is 3rd in line for feature support, if at all. This is why we have decoupled from spark in our project. It's really unfortunate spark team have invested so heavily in scale. As for speed it comes from horizontal scaling and throughout. When you can scale outward, individual VM performance is less an issue. Basic HPC principles. Darren, My guess is that data scientist who will decouple themselves from spark, will eventually left with more or less nothing. (single process capabilities, or purely performing HPC's) (unless, unlikely, some good spark competitor will emerge. unlikely, simply because there is no need for such). But putting guessing aside - the reason python is 3rd in line for feature support, is not because the spark developers were busy with scala, it's because the features that are missing are those that support strong typing. which is not relevant to python. in other words, even if spark was rewritten in python, and was to focus on python only, you would still not get those features. -- Tal Grynbaum / CTO & co-founder m# +972-54-7875797 [cid:image001.png@01D20532.AC944EB0] mobile retention done right -- Best Regards, Ayan Guha