To add another note on the benefits of using Scala to build Spark, here is
a very interesting and well-written post
<http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html>
on
the Databricks blog about how Scala 2.10's runtime reflection enables some
significant performance optimizations in Spark SQL.


On Wed, Jun 4, 2014 at 10:15 PM, Jeremy Lee <unorthodox.engine...@gmail.com>
wrote:

> I'm still a Spark newbie, but I have a heavy background in languages and
> compilers... so take this with a barrel of salt...
>
> Scala, to me, is the heart and soul of Spark. Couldn't work without it.
> Procedural languages like Python, Java, and all the rest are lovely when
> you have a couple of processors, but it doesn't scale. (pun intended) It's
> the same reason they had to invent a slew of 'Shader' languages for GPU
> programming. In fact, that's how I see Scala, as the "CUDA" or "GLSL" of
> cluster computing.
>
> Now, Scala isn't perfect. It could learn a thing or two from OCCAM about
> interprocess communication. (And from node.js about package management.)
> But functional programming becomes essential for highly-parallel code
> because the primary difference is that functional declares _what_ you want
> to do, and procedural declares _how_ you want to do it.
>
> Since you rarely know the shape of the cluster/graph ahead of time,
> functional programming becomes the superior paradigm, especially for the
> "outermost" parts of the program that interface with the scheduler. Python
> might be fine for the granular fragments, but you would have to export all
> those independent functions somehow, and define the scheduling and
> connective structure (the DAG) elsewhere, in yet another language or
> library.
>
> To fit neatly into GraphX, Python would probably have to be warped in the
> same way that GLSL is a stricter sub-set of C. You'd probably lose
> everything you like about the language, in order to make it seamless.
>
> I'm pretty agnostic about the whole Spark stack, and it's components, (eg:
> every time I run sbt/sbt assemble, Stuart Feldman dies a little inside and
> I get time to write another long email) but Scala is the one thing that
> gives it legs. I wish the rest of Spark was more like it. (ie: 'no
> ceremony')
>
> Scala might seem 'weird', but that's because it directly exposes
> parallelism, and the ways to cope with it. I've done enough distributed
> programming that the advantages are obvious, for that domain. You're not
> being asked to re-wire your thinking for Scala's benefit, but to solve the
> underlying problem. (But you are still being asked to turn your thinking
> sideways, I will admit.)
>
> People love Python because it 'fit' it's intended domain perfectly. That
> doesn't mean you'll love it just as much for embedded hardware, or GPU
> shader development, or Telecoms, or Spark.
>
> Then again, give me another week with the language, and see what I'm
> screaming about then ;-)
>
>
>
> On Thu, Jun 5, 2014 at 10:21 AM, John Omernik <j...@omernik.com> wrote:
>
>> Thank you for the response. If it helps at all: I demoed the Spark
>> platform for our data science team today. The idea of moving code from
>> batch testing, to Machine Learning systems, GraphX, and then to near-real
>> time models with streaming was cheered by the team as an efficiency they
>> would love.  That said, most folks, on our team are Python junkies, and
>> they love that Spark seems to be committing to Python, and would REALLY
>> love to see Python in Streaming, it would feel complete for them from a
>> platform standpoint. It is still awesome using Scala, and many will learn
>> that, but that full Python integration/support, if possible, would be a
>> home run.
>>
>>
>>
>>
>> On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia <matei.zaha...@gmail.com>
>> wrote:
>>
>>> We are definitely investigating a Python API for Streaming, but no
>>> announced deadline at this point.
>>>
>>> Matei
>>>
>>> On Jun 4, 2014, at 5:02 PM, John Omernik <j...@omernik.com> wrote:
>>>
>>> So Python is used in many of the Spark Ecosystem products, but not
>>> Streaming at this point. Is there a roadmap to include Python APIs in Spark
>>> Streaming? Anytime frame on this?
>>>
>>> Thanks!
>>>
>>> John
>>>
>>>
>>> On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia <matei.zaha...@gmail.com>
>>> wrote:
>>>
>>>> Quite a few people ask this question and the answer is pretty simple.
>>>> When we started Spark, we had two goals — we wanted to work with the Hadoop
>>>> ecosystem, which is JVM-based, and we wanted a concise programming
>>>> interface similar to Microsoft’s DryadLINQ (the first language-integrated
>>>> big data framework I know of, that begat things like FlumeJava and Crunch).
>>>> On the JVM, the only language that would offer that kind of API was Scala,
>>>> due to its ability to capture functions and ship them across the network.
>>>> Scala’s static typing also made it much easier to control performance
>>>> compared to, say, Jython or Groovy.
>>>>
>>>> In terms of usage, however, we see substantial usage of our other
>>>> languages (Java and Python), and we’re continuing to invest in both. In a
>>>> user survey we did last fall, about 25% of users used Java and 30% used
>>>> Python, and I imagine these numbers are growing. With lambda expressions
>>>> now added to Java 8 (
>>>> http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think
>>>> we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in
>>>> Python, which is very exciting to us in terms of ease of use.
>>>>
>>>> Matei
>>>>
>>>> On May 29, 2014, at 1:57 PM, Benjamin Black <b...@b3k.us> wrote:
>>>>
>>>> HN is a cesspool safely ignored.
>>>>
>>>>
>>>> On Thu, May 29, 2014 at 1:55 PM, Nick Chammas <
>>>> nicholas.cham...@gmail.com> wrote:
>>>>
>>>>> I recently discovered Hacker News and started reading through older
>>>>> posts about Scala
>>>>> <https://hn.algolia.com/?q=scala#!/story/forever/0/scala>. It looks
>>>>> like the language is fairly controversial on there, and it got me 
>>>>> thinking.
>>>>>
>>>>> Scala appears to be the preferred language to work with in Spark, and
>>>>> Spark itself is written in Scala, right?
>>>>>
>>>>> I know that often times a successful project evolves gradually out of
>>>>> something small, and that the choice of programming language may not 
>>>>> always
>>>>> have been made consciously at the outset.
>>>>>
>>>>> But pretending that it was, why is Scala the preferred language of
>>>>> Spark?
>>>>>
>>>>> Nick
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> View this message in context: Why Scala?
>>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/Why-Scala-tp6536.html>
>>>>> Sent from the Apache Spark User List mailing list archive
>>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com
>>>>> <http://nabble.com/>.
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
> --
> Jeremy Lee  BCompSci(Hons)
>   The Unorthodox Engineers
>

Reply via email to