Re: Why Scala?
To add another note on the benefits of using Scala to build Spark, here is a very interesting and well-written post http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html on the Databricks blog about how Scala 2.10's runtime reflection enables some significant performance optimizations in Spark SQL. On Wed, Jun 4, 2014 at 10:15 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: I'm still a Spark newbie, but I have a heavy background in languages and compilers... so take this with a barrel of salt... Scala, to me, is the heart and soul of Spark. Couldn't work without it. Procedural languages like Python, Java, and all the rest are lovely when you have a couple of processors, but it doesn't scale. (pun intended) It's the same reason they had to invent a slew of 'Shader' languages for GPU programming. In fact, that's how I see Scala, as the CUDA or GLSL of cluster computing. Now, Scala isn't perfect. It could learn a thing or two from OCCAM about interprocess communication. (And from node.js about package management.) But functional programming becomes essential for highly-parallel code because the primary difference is that functional declares _what_ you want to do, and procedural declares _how_ you want to do it. Since you rarely know the shape of the cluster/graph ahead of time, functional programming becomes the superior paradigm, especially for the outermost parts of the program that interface with the scheduler. Python might be fine for the granular fragments, but you would have to export all those independent functions somehow, and define the scheduling and connective structure (the DAG) elsewhere, in yet another language or library. To fit neatly into GraphX, Python would probably have to be warped in the same way that GLSL is a stricter sub-set of C. You'd probably lose everything you like about the language, in order to make it seamless. I'm pretty agnostic about the whole Spark stack, and it's components, (eg: every time I run sbt/sbt assemble, Stuart Feldman dies a little inside and I get time to write another long email) but Scala is the one thing that gives it legs. I wish the rest of Spark was more like it. (ie: 'no ceremony') Scala might seem 'weird', but that's because it directly exposes parallelism, and the ways to cope with it. I've done enough distributed programming that the advantages are obvious, for that domain. You're not being asked to re-wire your thinking for Scala's benefit, but to solve the underlying problem. (But you are still being asked to turn your thinking sideways, I will admit.) People love Python because it 'fit' it's intended domain perfectly. That doesn't mean you'll love it just as much for embedded hardware, or GPU shader development, or Telecoms, or Spark. Then again, give me another week with the language, and see what I'm screaming about then ;-) On Thu, Jun 5, 2014 at 10:21 AM, John Omernik j...@omernik.com wrote: Thank you for the response. If it helps at all: I demoed the Spark platform for our data science team today. The idea of moving code from batch testing, to Machine Learning systems, GraphX, and then to near-real time models with streaming was cheered by the team as an efficiency they would love. That said, most folks, on our team are Python junkies, and they love that Spark seems to be committing to Python, and would REALLY love to see Python in Streaming, it would feel complete for them from a platform standpoint. It is still awesome using Scala, and many will learn that, but that full Python integration/support, if possible, would be a home run. On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia matei.zaha...@gmail.com wrote: We are definitely investigating a Python API for Streaming, but no announced deadline at this point. Matei On Jun 4, 2014, at 5:02 PM, John Omernik j...@omernik.com wrote: So Python is used in many of the Spark Ecosystem products, but not Streaming at this point. Is there a roadmap to include Python APIs in Spark Streaming? Anytime frame on this? Thanks! John On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy. In terms of usage, however, we see substantial usage of our other languages (Java and Python), and we’re continuing to invest in both. In a
Re: Why Scala?
So Python is used in many of the Spark Ecosystem products, but not Streaming at this point. Is there a roadmap to include Python APIs in Spark Streaming? Anytime frame on this? Thanks! John On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy. In terms of usage, however, we see substantial usage of our other languages (Java and Python), and we’re continuing to invest in both. In a user survey we did last fall, about 25% of users used Java and 30% used Python, and I imagine these numbers are growing. With lambda expressions now added to Java 8 ( http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in Python, which is very exciting to us in terms of ease of use. Matei On May 29, 2014, at 1:57 PM, Benjamin Black b...@b3k.us wrote: HN is a cesspool safely ignored. On Thu, May 29, 2014 at 1:55 PM, Nick Chammas nicholas.cham...@gmail.com wrote: I recently discovered Hacker News and started reading through older posts about Scala https://hn.algolia.com/?q=scala#!/story/forever/0/scala. It looks like the language is fairly controversial on there, and it got me thinking. Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right? I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset. But pretending that it was, why is Scala the preferred language of Spark? Nick -- View this message in context: Why Scala? http://apache-spark-user-list.1001560.n3.nabble.com/Why-Scala-tp6536.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Re: Why Scala?
We are definitely investigating a Python API for Streaming, but no announced deadline at this point. Matei On Jun 4, 2014, at 5:02 PM, John Omernik j...@omernik.com wrote: So Python is used in many of the Spark Ecosystem products, but not Streaming at this point. Is there a roadmap to include Python APIs in Spark Streaming? Anytime frame on this? Thanks! John On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy. In terms of usage, however, we see substantial usage of our other languages (Java and Python), and we’re continuing to invest in both. In a user survey we did last fall, about 25% of users used Java and 30% used Python, and I imagine these numbers are growing. With lambda expressions now added to Java 8 (http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in Python, which is very exciting to us in terms of ease of use. Matei On May 29, 2014, at 1:57 PM, Benjamin Black b...@b3k.us wrote: HN is a cesspool safely ignored. On Thu, May 29, 2014 at 1:55 PM, Nick Chammas nicholas.cham...@gmail.com wrote: I recently discovered Hacker News and started reading through older posts about Scala. It looks like the language is fairly controversial on there, and it got me thinking. Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right? I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset. But pretending that it was, why is Scala the preferred language of Spark? Nick View this message in context: Why Scala? Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Why Scala?
Thank you for the response. If it helps at all: I demoed the Spark platform for our data science team today. The idea of moving code from batch testing, to Machine Learning systems, GraphX, and then to near-real time models with streaming was cheered by the team as an efficiency they would love. That said, most folks, on our team are Python junkies, and they love that Spark seems to be committing to Python, and would REALLY love to see Python in Streaming, it would feel complete for them from a platform standpoint. It is still awesome using Scala, and many will learn that, but that full Python integration/support, if possible, would be a home run. On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia matei.zaha...@gmail.com wrote: We are definitely investigating a Python API for Streaming, but no announced deadline at this point. Matei On Jun 4, 2014, at 5:02 PM, John Omernik j...@omernik.com wrote: So Python is used in many of the Spark Ecosystem products, but not Streaming at this point. Is there a roadmap to include Python APIs in Spark Streaming? Anytime frame on this? Thanks! John On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy. In terms of usage, however, we see substantial usage of our other languages (Java and Python), and we’re continuing to invest in both. In a user survey we did last fall, about 25% of users used Java and 30% used Python, and I imagine these numbers are growing. With lambda expressions now added to Java 8 ( http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in Python, which is very exciting to us in terms of ease of use. Matei On May 29, 2014, at 1:57 PM, Benjamin Black b...@b3k.us wrote: HN is a cesspool safely ignored. On Thu, May 29, 2014 at 1:55 PM, Nick Chammas nicholas.cham...@gmail.com wrote: I recently discovered Hacker News and started reading through older posts about Scala https://hn.algolia.com/?q=scala#!/story/forever/0/scala. It looks like the language is fairly controversial on there, and it got me thinking. Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right? I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset. But pretending that it was, why is Scala the preferred language of Spark? Nick -- View this message in context: Why Scala? http://apache-spark-user-list.1001560.n3.nabble.com/Why-Scala-tp6536.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com http://nabble.com/.
Re: Why Scala?
I'm still a Spark newbie, but I have a heavy background in languages and compilers... so take this with a barrel of salt... Scala, to me, is the heart and soul of Spark. Couldn't work without it. Procedural languages like Python, Java, and all the rest are lovely when you have a couple of processors, but it doesn't scale. (pun intended) It's the same reason they had to invent a slew of 'Shader' languages for GPU programming. In fact, that's how I see Scala, as the CUDA or GLSL of cluster computing. Now, Scala isn't perfect. It could learn a thing or two from OCCAM about interprocess communication. (And from node.js about package management.) But functional programming becomes essential for highly-parallel code because the primary difference is that functional declares _what_ you want to do, and procedural declares _how_ you want to do it. Since you rarely know the shape of the cluster/graph ahead of time, functional programming becomes the superior paradigm, especially for the outermost parts of the program that interface with the scheduler. Python might be fine for the granular fragments, but you would have to export all those independent functions somehow, and define the scheduling and connective structure (the DAG) elsewhere, in yet another language or library. To fit neatly into GraphX, Python would probably have to be warped in the same way that GLSL is a stricter sub-set of C. You'd probably lose everything you like about the language, in order to make it seamless. I'm pretty agnostic about the whole Spark stack, and it's components, (eg: every time I run sbt/sbt assemble, Stuart Feldman dies a little inside and I get time to write another long email) but Scala is the one thing that gives it legs. I wish the rest of Spark was more like it. (ie: 'no ceremony') Scala might seem 'weird', but that's because it directly exposes parallelism, and the ways to cope with it. I've done enough distributed programming that the advantages are obvious, for that domain. You're not being asked to re-wire your thinking for Scala's benefit, but to solve the underlying problem. (But you are still being asked to turn your thinking sideways, I will admit.) People love Python because it 'fit' it's intended domain perfectly. That doesn't mean you'll love it just as much for embedded hardware, or GPU shader development, or Telecoms, or Spark. Then again, give me another week with the language, and see what I'm screaming about then ;-) On Thu, Jun 5, 2014 at 10:21 AM, John Omernik j...@omernik.com wrote: Thank you for the response. If it helps at all: I demoed the Spark platform for our data science team today. The idea of moving code from batch testing, to Machine Learning systems, GraphX, and then to near-real time models with streaming was cheered by the team as an efficiency they would love. That said, most folks, on our team are Python junkies, and they love that Spark seems to be committing to Python, and would REALLY love to see Python in Streaming, it would feel complete for them from a platform standpoint. It is still awesome using Scala, and many will learn that, but that full Python integration/support, if possible, would be a home run. On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia matei.zaha...@gmail.com wrote: We are definitely investigating a Python API for Streaming, but no announced deadline at this point. Matei On Jun 4, 2014, at 5:02 PM, John Omernik j...@omernik.com wrote: So Python is used in many of the Spark Ecosystem products, but not Streaming at this point. Is there a roadmap to include Python APIs in Spark Streaming? Anytime frame on this? Thanks! John On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy. In terms of usage, however, we see substantial usage of our other languages (Java and Python), and we’re continuing to invest in both. In a user survey we did last fall, about 25% of users used Java and 30% used Python, and I imagine these numbers are growing. With lambda expressions now added to Java 8 ( http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in Python, which is very exciting to us in terms of ease of use. Matei On May 29, 2014, at 1:57 PM, Benjamin Black b...@b3k.us wrote: HN is a cesspool
Re: Why Scala?
Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy. In terms of usage, however, we see substantial usage of our other languages (Java and Python), and we’re continuing to invest in both. In a user survey we did last fall, about 25% of users used Java and 30% used Python, and I imagine these numbers are growing. With lambda expressions now added to Java 8 (http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in Python, which is very exciting to us in terms of ease of use. Matei On May 29, 2014, at 1:57 PM, Benjamin Black b...@b3k.us wrote: HN is a cesspool safely ignored. On Thu, May 29, 2014 at 1:55 PM, Nick Chammas nicholas.cham...@gmail.com wrote: I recently discovered Hacker News and started reading through older posts about Scala. It looks like the language is fairly controversial on there, and it got me thinking. Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right? I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset. But pretending that it was, why is Scala the preferred language of Spark? Nick View this message in context: Why Scala? Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Why Scala?
There were few known concerns about Scala, and some still are, but having been doing Scala professionally over two years now, i learned to master and appreciate the advanatages. Major concern IMO is Scala in a less-than-scrupulous corporate environment. First, Scala requires significantly more discipline in commenting and style to still stay painlessly readable, than java. People with less than stellar code hygiene can easily turn a project into an unmaintainable mess. Second, from corporate management prospective, it is (still?) much harder to staff with Scala coders as opposed to Java ones. All these things are a headache for corporate bosses, but for public and academic projects with thorough peer review and increased desire for contributors to look clean in public it works out quite well, and strong sides really shine. Spark specifically builds around FP patterns -- such as monads and functors -- which were absent in java prior to 8 (i am not sure that they are as well worked out in java 8 collections even now, as opposed to Scala collections). So java 8 simply comes a little late to the show in that department. Also FP is not the only thing that is used by Spark. Spark also uses stuff like implicits, akka/agent framework for IPC. Let's not forget that FP is albeit important but only one out of many stories in Scala in the grand scale of things. On Thu, May 29, 2014 at 1:55 PM, Nick Chammas nicholas.cham...@gmail.comwrote: I recently discovered Hacker News and started reading through older posts about Scala https://hn.algolia.com/?q=scala#!/story/forever/0/scala. It looks like the language is fairly controversial on there, and it got me thinking. Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right? I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset. But pretending that it was, why is Scala the preferred language of Spark? Nick -- View this message in context: Why Scala?http://apache-spark-user-list.1001560.n3.nabble.com/Why-Scala-tp6536.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.
Re: Why Scala?
Nicholas, Good question. Couple of thoughts from my practical experience: - Coming from R, Scala feels more natural than other languages. The functional succinctness of Scala is more suited for Data Science than other languages. In short, Scala-Spark makes sense, for Data Science, ML, Data Exploration et al - Having said that occasionally practicality does trump the choice of a language - last time I really wanted to use Scala but ended up in writing in Python ! Hope to get a better result this time - Language evolution is more of a long term granularity - we do underestimate the velocity impact. Have seen evolutions through languages starting from Cobol, CCP/M Basic,Turbo Pascal, ... I think Scala will find it's equilibrium sooner than we think ... Cheers k/ On Thu, May 29, 2014 at 5:54 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thank you for the specific points about the advantages Scala provides over other languages. Looking at several code samples, the reduction of boilerplate code over Java is one of the biggest plusses, to me. On Thu, May 29, 2014 at 8:10 PM, Marek Kolodziej mkolod@gmail.com wrote: I would advise others to form their opinions based on experiencing it for themselves, rather than reading what random people say on Hacker News. :) Just a nitpick here: What I said was It looks like the language is fairly controversial on [Hacker News.] That was just an observation of what I saw on HN, not a statement of my opinion. I know very little about Scala (or Java, for that matter) and definitely don't have a well-formed opinion on the matter. Nick