To add another note on the benefits of using Scala to build Spark, here is a very interesting and well-written post <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html> on the Databricks blog about how Scala 2.10's runtime reflection enables some significant performance optimizations in Spark SQL.
On Wed, Jun 4, 2014 at 10:15 PM, Jeremy Lee <unorthodox.engine...@gmail.com> wrote: > I'm still a Spark newbie, but I have a heavy background in languages and > compilers... so take this with a barrel of salt... > > Scala, to me, is the heart and soul of Spark. Couldn't work without it. > Procedural languages like Python, Java, and all the rest are lovely when > you have a couple of processors, but it doesn't scale. (pun intended) It's > the same reason they had to invent a slew of 'Shader' languages for GPU > programming. In fact, that's how I see Scala, as the "CUDA" or "GLSL" of > cluster computing. > > Now, Scala isn't perfect. It could learn a thing or two from OCCAM about > interprocess communication. (And from node.js about package management.) > But functional programming becomes essential for highly-parallel code > because the primary difference is that functional declares _what_ you want > to do, and procedural declares _how_ you want to do it. > > Since you rarely know the shape of the cluster/graph ahead of time, > functional programming becomes the superior paradigm, especially for the > "outermost" parts of the program that interface with the scheduler. Python > might be fine for the granular fragments, but you would have to export all > those independent functions somehow, and define the scheduling and > connective structure (the DAG) elsewhere, in yet another language or > library. > > To fit neatly into GraphX, Python would probably have to be warped in the > same way that GLSL is a stricter sub-set of C. You'd probably lose > everything you like about the language, in order to make it seamless. > > I'm pretty agnostic about the whole Spark stack, and it's components, (eg: > every time I run sbt/sbt assemble, Stuart Feldman dies a little inside and > I get time to write another long email) but Scala is the one thing that > gives it legs. I wish the rest of Spark was more like it. (ie: 'no > ceremony') > > Scala might seem 'weird', but that's because it directly exposes > parallelism, and the ways to cope with it. I've done enough distributed > programming that the advantages are obvious, for that domain. You're not > being asked to re-wire your thinking for Scala's benefit, but to solve the > underlying problem. (But you are still being asked to turn your thinking > sideways, I will admit.) > > People love Python because it 'fit' it's intended domain perfectly. That > doesn't mean you'll love it just as much for embedded hardware, or GPU > shader development, or Telecoms, or Spark. > > Then again, give me another week with the language, and see what I'm > screaming about then ;-) > > > > On Thu, Jun 5, 2014 at 10:21 AM, John Omernik <j...@omernik.com> wrote: > >> Thank you for the response. If it helps at all: I demoed the Spark >> platform for our data science team today. The idea of moving code from >> batch testing, to Machine Learning systems, GraphX, and then to near-real >> time models with streaming was cheered by the team as an efficiency they >> would love. That said, most folks, on our team are Python junkies, and >> they love that Spark seems to be committing to Python, and would REALLY >> love to see Python in Streaming, it would feel complete for them from a >> platform standpoint. It is still awesome using Scala, and many will learn >> that, but that full Python integration/support, if possible, would be a >> home run. >> >> >> >> >> On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia <matei.zaha...@gmail.com> >> wrote: >> >>> We are definitely investigating a Python API for Streaming, but no >>> announced deadline at this point. >>> >>> Matei >>> >>> On Jun 4, 2014, at 5:02 PM, John Omernik <j...@omernik.com> wrote: >>> >>> So Python is used in many of the Spark Ecosystem products, but not >>> Streaming at this point. Is there a roadmap to include Python APIs in Spark >>> Streaming? Anytime frame on this? >>> >>> Thanks! >>> >>> John >>> >>> >>> On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia <matei.zaha...@gmail.com> >>> wrote: >>> >>>> Quite a few people ask this question and the answer is pretty simple. >>>> When we started Spark, we had two goals — we wanted to work with the Hadoop >>>> ecosystem, which is JVM-based, and we wanted a concise programming >>>> interface similar to Microsoft’s DryadLINQ (the first language-integrated >>>> big data framework I know of, that begat things like FlumeJava and Crunch). >>>> On the JVM, the only language that would offer that kind of API was Scala, >>>> due to its ability to capture functions and ship them across the network. >>>> Scala’s static typing also made it much easier to control performance >>>> compared to, say, Jython or Groovy. >>>> >>>> In terms of usage, however, we see substantial usage of our other >>>> languages (Java and Python), and we’re continuing to invest in both. In a >>>> user survey we did last fall, about 25% of users used Java and 30% used >>>> Python, and I imagine these numbers are growing. With lambda expressions >>>> now added to Java 8 ( >>>> http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think >>>> we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in >>>> Python, which is very exciting to us in terms of ease of use. >>>> >>>> Matei >>>> >>>> On May 29, 2014, at 1:57 PM, Benjamin Black <b...@b3k.us> wrote: >>>> >>>> HN is a cesspool safely ignored. >>>> >>>> >>>> On Thu, May 29, 2014 at 1:55 PM, Nick Chammas < >>>> nicholas.cham...@gmail.com> wrote: >>>> >>>>> I recently discovered Hacker News and started reading through older >>>>> posts about Scala >>>>> <https://hn.algolia.com/?q=scala#!/story/forever/0/scala>. It looks >>>>> like the language is fairly controversial on there, and it got me >>>>> thinking. >>>>> >>>>> Scala appears to be the preferred language to work with in Spark, and >>>>> Spark itself is written in Scala, right? >>>>> >>>>> I know that often times a successful project evolves gradually out of >>>>> something small, and that the choice of programming language may not >>>>> always >>>>> have been made consciously at the outset. >>>>> >>>>> But pretending that it was, why is Scala the preferred language of >>>>> Spark? >>>>> >>>>> Nick >>>>> >>>>> >>>>> ------------------------------ >>>>> View this message in context: Why Scala? >>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/Why-Scala-tp6536.html> >>>>> Sent from the Apache Spark User List mailing list archive >>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com >>>>> <http://nabble.com/>. >>>>> >>>> >>>> >>>> >>> >>> >> > > > -- > Jeremy Lee BCompSci(Hons) > The Unorthodox Engineers >