As you point out, often the reason that Python support lags behind is that
functionality is implemented in Scala, so the API in that language is
"free" whereas Python support needs to be added explicitly. Nevertheless,
Python bindings are an important part of Spark and is used by many people
(this info could be outdated but Python used to be the second most popular
language after Scala). I expect Python support to only get better in the
future so I think it is fair to say that Python is a first-class citizen in
Spark.

Regarding performance, the issue is more complicated. This is mostly due to
the fact that the actual execution of actions happens in JVM-land and any
correspondance between Python and the JVM is expensive. So the question
basically boils down to "how often does python need to communicate with the
JVM"? The answer depends on the Spark APIs you're using:

1. Plain old RDDs: for every function you pass to a transformation (filter,
map, etc) an intermediate result will be shipped to a Pyhon interpreter,
the function applied, and finally the result shipped back to the JVM.
2. DataFrames with RDD-like transformations or User Defined Functions: same
as point 1, any functions are applied in a Python environment and hence
data needs to be transferred.
3. DataFrames with only SQL expressions: Spark query optimizer will take
care of computing and executing an internal representation of your
transformations and no data communication needs to happen between Python
and the JVM (apart from final results in case you asked for them, i.e. by
calling a collect()).

In cases 1 and 2, there will be a lack in performance compared to
equivalent Scala or Java versions. The difference in case 3 is negligible
as all language APIs will share the same backend .See this blog post from
Databricks for some more detailed information:
https://databricks.com/blog/2015/04/24/recent-performance-improvements-in-apache-spark-sql-python-dataframes-and-more.html

I hope this was the kind of information you were looking for. Please note
however that performance in Spark is a complex topic, the scenarios I
mentioned above should nevertheless give you some rule of thumb.

best,
--Jakob

On Thu, Sep 1, 2016 at 11:25 PM, ayan guha <guha.a...@gmail.com> wrote:

> Tal: I think by nature of the project itself, Python APIs are developed
> after Scala and Java, and it is a fair trade off between speed of getting
> stuff to market. And more and more this discussion is progressing, I see
> not much issue in terms of feature parity.
>
> Coming back to performance, Darren raised a good point: if I can scale
> out, individual VM performance should not matter much. But performance is
> often stated as a definitive downside of using Python over scala/java. I am
> trying to understand the truth and myth behind this claim. Any pointer
> would be great.
>
> best
> Ayan
>
> On Fri, Sep 2, 2016 at 4:10 PM, Tal Grynbaum <tal.grynb...@gmail.com>
> wrote:
>
>>
>> On Fri, Sep 2, 2016 at 1:15 AM, darren <dar...@ontrenet.com> wrote:
>>
>>> This topic is a concern for us as well. In the data science world no one
>>> uses native scala or java by choice. It's R and Python. And python is
>>> growing. Yet in spark, python is 3rd in line for feature support, if at all.
>>>
>>> This is why we have decoupled from spark in our project. It's really
>>> unfortunate spark team have invested so heavily in scale.
>>>
>>> As for speed it comes from horizontal scaling and throughout. When you
>>> can scale outward, individual VM performance is less an issue. Basic HPC
>>> principles.
>>>
>>
>> Darren,
>>
>> My guess is that data scientist who will decouple themselves from spark,
>> will eventually left with more or less nothing. (single process
>> capabilities, or purely performing HPC's) (unless, unlikely, some good
>> spark competitor will emerge.  unlikely, simply because there is no need
>> for such).
>> But putting guessing aside - the reason python is 3rd in line for feature
>> support, is not because the spark developers were busy with scala, it's
>> because the features that are missing are those that support strong typing.
>> which is not relevant to python.  in other words, even if spark was
>> rewritten in python, and was to focus on python only, you would still not
>> get those features.
>>
>>
>>
>> --
>> *Tal Grynbaum* / *CTO & co-founder*
>>
>> m# +972-54-7875797
>>
>>         mobile retention done right
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Reply via email to