Hello Joshua,

comments are inline...

> On Mar 1, 2016, at 5:03 AM, Joshua Sorrell <jsor...@gmail.com> wrote:
> 
> I haven't used Spark in the last year and a half. I am about to start a 
> project with a new team, and we need to decide whether to use pyspark or 
> Scala.

Indeed, good questions, and they do come up lot in trainings that I have 
attended, where this inevitable question is raised.
I believe, it depends on your level of comfort zone or adventure into newer 
things.

True, for the most part that Apache Spark committers have been committed to 
keep the APIs at parity across all the language offerings, even though in some 
cases, in particular Python, they have lagged by a minor release. To the the 
extent that they’re committed to level-parity is a good sign. It might to be 
the case with some experimental APIs, where they lag behind,  but for the most 
part, they have been admirably consistent. 

With Python there’s a minor performance hit, since there’s an extra level of 
indirection in the architecture and an additional Python PID that the executors 
launch to execute your pickled Python lambdas. Other than that it boils down to 
your comfort zone. I recommend looking at Sameer’s slides on (Advanced Spark 
for DevOps Training) where he walks through the pySpark and Python 
architecture. 
> 
> We are NOT a java shop. So some of the build tools/procedures will require 
> some learning overhead if we go the Scala route. What I want to know is: is 
> the Scala version of Spark still far enough ahead of pyspark to be well worth 
> any initial training overhead?  

If you are a very advanced Python shop and if you’ve in-house libraries that 
you have written in Python that don’t exist in Scala or some ML libs that don’t 
exist in the Scala version and will require fair amount of porting and gap is 
too large, then perhaps it makes sense to stay put with Python.

However, I believe, investing (or having some members of your group) learn and 
invest in Scala is worthwhile for few reasons. One, you will get the 
performance gain, especially now with Tungsten (not sure how it relates to 
Python, but some other knowledgeable people on the list, please chime in). Two, 
since Spark is written in Scala, it gives you an enormous advantage to read 
sources (which are well documented and highly readable) should you have to 
consult or learn nuances of certain API method or action not covered 
comprehensively in the docs. And finally, there’s a long term benefit in 
learning Scala for reasons other than Spark. For example, writing other 
scalable and distributed applications.
> 
> Particularly, we will be using Spark Streaming. I know a couple of years ago 
> that practically forced the decision to use Scala.  Is this still the case?

You’ll notice that certain APIs call are not available, at least for now, in 
Python. http://spark.apache.org/docs/latest/streaming-programming-guide.html 
<http://spark.apache.org/docs/latest/streaming-programming-guide.html>


Cheers
Jules

--
The Best Ideas Are Simple
Jules S. Damji
e-mail:dmat...@comcast.net
e-mail:jules.da...@gmail.com

Reply via email to