Re: Why Scala?

2014-06-06 Thread Nicholas Chammas
To add another note on the benefits of using Scala to build Spark, here is
a very interesting and well-written post
http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
on
the Databricks blog about how Scala 2.10's runtime reflection enables some
significant performance optimizations in Spark SQL.


On Wed, Jun 4, 2014 at 10:15 PM, Jeremy Lee unorthodox.engine...@gmail.com
wrote:

 I'm still a Spark newbie, but I have a heavy background in languages and
 compilers... so take this with a barrel of salt...

 Scala, to me, is the heart and soul of Spark. Couldn't work without it.
 Procedural languages like Python, Java, and all the rest are lovely when
 you have a couple of processors, but it doesn't scale. (pun intended) It's
 the same reason they had to invent a slew of 'Shader' languages for GPU
 programming. In fact, that's how I see Scala, as the CUDA or GLSL of
 cluster computing.

 Now, Scala isn't perfect. It could learn a thing or two from OCCAM about
 interprocess communication. (And from node.js about package management.)
 But functional programming becomes essential for highly-parallel code
 because the primary difference is that functional declares _what_ you want
 to do, and procedural declares _how_ you want to do it.

 Since you rarely know the shape of the cluster/graph ahead of time,
 functional programming becomes the superior paradigm, especially for the
 outermost parts of the program that interface with the scheduler. Python
 might be fine for the granular fragments, but you would have to export all
 those independent functions somehow, and define the scheduling and
 connective structure (the DAG) elsewhere, in yet another language or
 library.

 To fit neatly into GraphX, Python would probably have to be warped in the
 same way that GLSL is a stricter sub-set of C. You'd probably lose
 everything you like about the language, in order to make it seamless.

 I'm pretty agnostic about the whole Spark stack, and it's components, (eg:
 every time I run sbt/sbt assemble, Stuart Feldman dies a little inside and
 I get time to write another long email) but Scala is the one thing that
 gives it legs. I wish the rest of Spark was more like it. (ie: 'no
 ceremony')

 Scala might seem 'weird', but that's because it directly exposes
 parallelism, and the ways to cope with it. I've done enough distributed
 programming that the advantages are obvious, for that domain. You're not
 being asked to re-wire your thinking for Scala's benefit, but to solve the
 underlying problem. (But you are still being asked to turn your thinking
 sideways, I will admit.)

 People love Python because it 'fit' it's intended domain perfectly. That
 doesn't mean you'll love it just as much for embedded hardware, or GPU
 shader development, or Telecoms, or Spark.

 Then again, give me another week with the language, and see what I'm
 screaming about then ;-)



 On Thu, Jun 5, 2014 at 10:21 AM, John Omernik j...@omernik.com wrote:

 Thank you for the response. If it helps at all: I demoed the Spark
 platform for our data science team today. The idea of moving code from
 batch testing, to Machine Learning systems, GraphX, and then to near-real
 time models with streaming was cheered by the team as an efficiency they
 would love.  That said, most folks, on our team are Python junkies, and
 they love that Spark seems to be committing to Python, and would REALLY
 love to see Python in Streaming, it would feel complete for them from a
 platform standpoint. It is still awesome using Scala, and many will learn
 that, but that full Python integration/support, if possible, would be a
 home run.




 On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 We are definitely investigating a Python API for Streaming, but no
 announced deadline at this point.

 Matei

 On Jun 4, 2014, at 5:02 PM, John Omernik j...@omernik.com wrote:

 So Python is used in many of the Spark Ecosystem products, but not
 Streaming at this point. Is there a roadmap to include Python APIs in Spark
 Streaming? Anytime frame on this?

 Thanks!

 John


 On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Quite a few people ask this question and the answer is pretty simple.
 When we started Spark, we had two goals — we wanted to work with the Hadoop
 ecosystem, which is JVM-based, and we wanted a concise programming
 interface similar to Microsoft’s DryadLINQ (the first language-integrated
 big data framework I know of, that begat things like FlumeJava and Crunch).
 On the JVM, the only language that would offer that kind of API was Scala,
 due to its ability to capture functions and ship them across the network.
 Scala’s static typing also made it much easier to control performance
 compared to, say, Jython or Groovy.

 In terms of usage, however, we see substantial usage of our other
 languages (Java and Python), and we’re continuing to invest in both. In a

Re: Why Scala?

2014-06-04 Thread John Omernik
So Python is used in many of the Spark Ecosystem products, but not
Streaming at this point. Is there a roadmap to include Python APIs in Spark
Streaming? Anytime frame on this?

Thanks!

John


On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Quite a few people ask this question and the answer is pretty simple. When
 we started Spark, we had two goals — we wanted to work with the Hadoop
 ecosystem, which is JVM-based, and we wanted a concise programming
 interface similar to Microsoft’s DryadLINQ (the first language-integrated
 big data framework I know of, that begat things like FlumeJava and Crunch).
 On the JVM, the only language that would offer that kind of API was Scala,
 due to its ability to capture functions and ship them across the network.
 Scala’s static typing also made it much easier to control performance
 compared to, say, Jython or Groovy.

 In terms of usage, however, we see substantial usage of our other
 languages (Java and Python), and we’re continuing to invest in both. In a
 user survey we did last fall, about 25% of users used Java and 30% used
 Python, and I imagine these numbers are growing. With lambda expressions
 now added to Java 8 (
 http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think
 we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in
 Python, which is very exciting to us in terms of ease of use.

 Matei

 On May 29, 2014, at 1:57 PM, Benjamin Black b...@b3k.us wrote:

 HN is a cesspool safely ignored.


 On Thu, May 29, 2014 at 1:55 PM, Nick Chammas nicholas.cham...@gmail.com
 wrote:

 I recently discovered Hacker News and started reading through older
 posts about Scala
 https://hn.algolia.com/?q=scala#!/story/forever/0/scala. It looks like
 the language is fairly controversial on there, and it got me thinking.

 Scala appears to be the preferred language to work with in Spark, and
 Spark itself is written in Scala, right?

 I know that often times a successful project evolves gradually out of
 something small, and that the choice of programming language may not always
 have been made consciously at the outset.

 But pretending that it was, why is Scala the preferred language of Spark?

 Nick


 --
 View this message in context: Why Scala?
 http://apache-spark-user-list.1001560.n3.nabble.com/Why-Scala-tp6536.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.






Re: Why Scala?

2014-06-04 Thread Matei Zaharia
We are definitely investigating a Python API for Streaming, but no announced 
deadline at this point.

Matei

On Jun 4, 2014, at 5:02 PM, John Omernik j...@omernik.com wrote:

 So Python is used in many of the Spark Ecosystem products, but not Streaming 
 at this point. Is there a roadmap to include Python APIs in Spark Streaming? 
 Anytime frame on this? 
 
 Thanks!
 
 John
 
 
 On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 Quite a few people ask this question and the answer is pretty simple. When we 
 started Spark, we had two goals — we wanted to work with the Hadoop 
 ecosystem, which is JVM-based, and we wanted a concise programming interface 
 similar to Microsoft’s DryadLINQ (the first language-integrated big data 
 framework I know of, that begat things like FlumeJava and Crunch). On the 
 JVM, the only language that would offer that kind of API was Scala, due to 
 its ability to capture functions and ship them across the network. Scala’s 
 static typing also made it much easier to control performance compared to, 
 say, Jython or Groovy.
 
 In terms of usage, however, we see substantial usage of our other languages 
 (Java and Python), and we’re continuing to invest in both. In a user survey 
 we did last fall, about 25% of users used Java and 30% used Python, and I 
 imagine these numbers are growing. With lambda expressions now added to Java 
 8 (http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think 
 we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in 
 Python, which is very exciting to us in terms of ease of use.
 
 Matei
 
 On May 29, 2014, at 1:57 PM, Benjamin Black b...@b3k.us wrote:
 
 HN is a cesspool safely ignored.
 
 
 On Thu, May 29, 2014 at 1:55 PM, Nick Chammas nicholas.cham...@gmail.com 
 wrote:
 I recently discovered Hacker News and started reading through older posts 
 about Scala. It looks like the language is fairly controversial on there, 
 and it got me thinking.
 
 Scala appears to be the preferred language to work with in Spark, and Spark 
 itself is written in Scala, right?
 
 I know that often times a successful project evolves gradually out of 
 something small, and that the choice of programming language may not always 
 have been made consciously at the outset.
 
 But pretending that it was, why is Scala the preferred language of Spark?
 
 Nick
 
 
 View this message in context: Why Scala?
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 
 



Re: Why Scala?

2014-06-04 Thread John Omernik
Thank you for the response. If it helps at all: I demoed the Spark platform
for our data science team today. The idea of moving code from batch
testing, to Machine Learning systems, GraphX, and then to near-real time
models with streaming was cheered by the team as an efficiency they would
love.  That said, most folks, on our team are Python junkies, and they love
that Spark seems to be committing to Python, and would REALLY love to see
Python in Streaming, it would feel complete for them from a platform
standpoint. It is still awesome using Scala, and many will learn that, but
that full Python integration/support, if possible, would be a home run.




On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 We are definitely investigating a Python API for Streaming, but no
 announced deadline at this point.

 Matei

 On Jun 4, 2014, at 5:02 PM, John Omernik j...@omernik.com wrote:

 So Python is used in many of the Spark Ecosystem products, but not
 Streaming at this point. Is there a roadmap to include Python APIs in Spark
 Streaming? Anytime frame on this?

 Thanks!

 John


 On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Quite a few people ask this question and the answer is pretty simple.
 When we started Spark, we had two goals — we wanted to work with the Hadoop
 ecosystem, which is JVM-based, and we wanted a concise programming
 interface similar to Microsoft’s DryadLINQ (the first language-integrated
 big data framework I know of, that begat things like FlumeJava and Crunch).
 On the JVM, the only language that would offer that kind of API was Scala,
 due to its ability to capture functions and ship them across the network.
 Scala’s static typing also made it much easier to control performance
 compared to, say, Jython or Groovy.

 In terms of usage, however, we see substantial usage of our other
 languages (Java and Python), and we’re continuing to invest in both. In a
 user survey we did last fall, about 25% of users used Java and 30% used
 Python, and I imagine these numbers are growing. With lambda expressions
 now added to Java 8 (
 http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think
 we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in
 Python, which is very exciting to us in terms of ease of use.

 Matei

 On May 29, 2014, at 1:57 PM, Benjamin Black b...@b3k.us wrote:

 HN is a cesspool safely ignored.


 On Thu, May 29, 2014 at 1:55 PM, Nick Chammas nicholas.cham...@gmail.com
  wrote:

 I recently discovered Hacker News and started reading through older
 posts about Scala
 https://hn.algolia.com/?q=scala#!/story/forever/0/scala. It looks
 like the language is fairly controversial on there, and it got me thinking.

 Scala appears to be the preferred language to work with in Spark, and
 Spark itself is written in Scala, right?

 I know that often times a successful project evolves gradually out of
 something small, and that the choice of programming language may not always
 have been made consciously at the outset.

 But pretending that it was, why is Scala the preferred language of Spark?

 Nick


 --
 View this message in context: Why Scala?
 http://apache-spark-user-list.1001560.n3.nabble.com/Why-Scala-tp6536.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com
 http://nabble.com/.








Re: Why Scala?

2014-06-04 Thread Jeremy Lee
I'm still a Spark newbie, but I have a heavy background in languages and
compilers... so take this with a barrel of salt...

Scala, to me, is the heart and soul of Spark. Couldn't work without it.
Procedural languages like Python, Java, and all the rest are lovely when
you have a couple of processors, but it doesn't scale. (pun intended) It's
the same reason they had to invent a slew of 'Shader' languages for GPU
programming. In fact, that's how I see Scala, as the CUDA or GLSL of
cluster computing.

Now, Scala isn't perfect. It could learn a thing or two from OCCAM about
interprocess communication. (And from node.js about package management.)
But functional programming becomes essential for highly-parallel code
because the primary difference is that functional declares _what_ you want
to do, and procedural declares _how_ you want to do it.

Since you rarely know the shape of the cluster/graph ahead of time,
functional programming becomes the superior paradigm, especially for the
outermost parts of the program that interface with the scheduler. Python
might be fine for the granular fragments, but you would have to export all
those independent functions somehow, and define the scheduling and
connective structure (the DAG) elsewhere, in yet another language or
library.

To fit neatly into GraphX, Python would probably have to be warped in the
same way that GLSL is a stricter sub-set of C. You'd probably lose
everything you like about the language, in order to make it seamless.

I'm pretty agnostic about the whole Spark stack, and it's components, (eg:
every time I run sbt/sbt assemble, Stuart Feldman dies a little inside and
I get time to write another long email) but Scala is the one thing that
gives it legs. I wish the rest of Spark was more like it. (ie: 'no
ceremony')

Scala might seem 'weird', but that's because it directly exposes
parallelism, and the ways to cope with it. I've done enough distributed
programming that the advantages are obvious, for that domain. You're not
being asked to re-wire your thinking for Scala's benefit, but to solve the
underlying problem. (But you are still being asked to turn your thinking
sideways, I will admit.)

People love Python because it 'fit' it's intended domain perfectly. That
doesn't mean you'll love it just as much for embedded hardware, or GPU
shader development, or Telecoms, or Spark.

Then again, give me another week with the language, and see what I'm
screaming about then ;-)



On Thu, Jun 5, 2014 at 10:21 AM, John Omernik j...@omernik.com wrote:

 Thank you for the response. If it helps at all: I demoed the Spark
 platform for our data science team today. The idea of moving code from
 batch testing, to Machine Learning systems, GraphX, and then to near-real
 time models with streaming was cheered by the team as an efficiency they
 would love.  That said, most folks, on our team are Python junkies, and
 they love that Spark seems to be committing to Python, and would REALLY
 love to see Python in Streaming, it would feel complete for them from a
 platform standpoint. It is still awesome using Scala, and many will learn
 that, but that full Python integration/support, if possible, would be a
 home run.




 On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 We are definitely investigating a Python API for Streaming, but no
 announced deadline at this point.

 Matei

 On Jun 4, 2014, at 5:02 PM, John Omernik j...@omernik.com wrote:

 So Python is used in many of the Spark Ecosystem products, but not
 Streaming at this point. Is there a roadmap to include Python APIs in Spark
 Streaming? Anytime frame on this?

 Thanks!

 John


 On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Quite a few people ask this question and the answer is pretty simple.
 When we started Spark, we had two goals — we wanted to work with the Hadoop
 ecosystem, which is JVM-based, and we wanted a concise programming
 interface similar to Microsoft’s DryadLINQ (the first language-integrated
 big data framework I know of, that begat things like FlumeJava and Crunch).
 On the JVM, the only language that would offer that kind of API was Scala,
 due to its ability to capture functions and ship them across the network.
 Scala’s static typing also made it much easier to control performance
 compared to, say, Jython or Groovy.

 In terms of usage, however, we see substantial usage of our other
 languages (Java and Python), and we’re continuing to invest in both. In a
 user survey we did last fall, about 25% of users used Java and 30% used
 Python, and I imagine these numbers are growing. With lambda expressions
 now added to Java 8 (
 http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think
 we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in
 Python, which is very exciting to us in terms of ease of use.

 Matei

 On May 29, 2014, at 1:57 PM, Benjamin Black b...@b3k.us wrote:

 HN is a cesspool 

Re: Why Scala?

2014-05-29 Thread Matei Zaharia
Quite a few people ask this question and the answer is pretty simple. When we 
started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, 
which is JVM-based, and we wanted a concise programming interface similar to 
Microsoft’s DryadLINQ (the first language-integrated big data framework I know 
of, that begat things like FlumeJava and Crunch). On the JVM, the only language 
that would offer that kind of API was Scala, due to its ability to capture 
functions and ship them across the network. Scala’s static typing also made it 
much easier to control performance compared to, say, Jython or Groovy.

In terms of usage, however, we see substantial usage of our other languages 
(Java and Python), and we’re continuing to invest in both. In a user survey we 
did last fall, about 25% of users used Java and 30% used Python, and I imagine 
these numbers are growing. With lambda expressions now added to Java 8 
(http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think we’ll 
see a lot more Java. And at Databricks I’ve seen a lot of interest in Python, 
which is very exciting to us in terms of ease of use.

Matei

On May 29, 2014, at 1:57 PM, Benjamin Black b...@b3k.us wrote:

 HN is a cesspool safely ignored.
 
 
 On Thu, May 29, 2014 at 1:55 PM, Nick Chammas nicholas.cham...@gmail.com 
 wrote:
 I recently discovered Hacker News and started reading through older posts 
 about Scala. It looks like the language is fairly controversial on there, and 
 it got me thinking.
 
 Scala appears to be the preferred language to work with in Spark, and Spark 
 itself is written in Scala, right?
 
 I know that often times a successful project evolves gradually out of 
 something small, and that the choice of programming language may not always 
 have been made consciously at the outset.
 
 But pretending that it was, why is Scala the preferred language of Spark?
 
 Nick
 
 
 View this message in context: Why Scala?
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 



Re: Why Scala?

2014-05-29 Thread Dmitriy Lyubimov
There were few known concerns about Scala, and some still are, but having
been doing Scala professionally over two years now, i learned to master and
appreciate the advanatages.

Major concern IMO is Scala in a less-than-scrupulous corporate environment.

First, Scala requires significantly more discipline in commenting and style
to still stay painlessly readable, than java. People with less than stellar
code hygiene can easily turn a project into an unmaintainable mess.

Second, from corporate management prospective, it is (still?) much harder
to staff with Scala coders as opposed to Java ones.

All these things are a headache for corporate bosses, but for public and
academic projects with thorough peer review and increased desire for
contributors to look clean in public it works out quite well, and strong
sides really shine.

Spark specifically builds around FP patterns -- such as monads and functors
-- which were absent in java prior to  8 (i am not sure that they are as
well worked out in java 8 collections even now, as opposed to Scala
collections). So java 8 simply comes a little late to the show in that
department.

Also FP is not the only thing that is used by Spark. Spark also uses stuff
like implicits, akka/agent framework for IPC. Let's not forget that FP is
albeit important but only one out of many  stories in Scala in the grand
scale of things.


On Thu, May 29, 2014 at 1:55 PM, Nick Chammas nicholas.cham...@gmail.comwrote:

 I recently discovered Hacker News and started reading through older posts
 about Scala https://hn.algolia.com/?q=scala#!/story/forever/0/scala. It
 looks like the language is fairly controversial on there, and it got me
 thinking.

 Scala appears to be the preferred language to work with in Spark, and
 Spark itself is written in Scala, right?

 I know that often times a successful project evolves gradually out of
 something small, and that the choice of programming language may not always
 have been made consciously at the outset.

 But pretending that it was, why is Scala the preferred language of Spark?

 Nick


 --
 View this message in context: Why 
 Scala?http://apache-spark-user-list.1001560.n3.nabble.com/Why-Scala-tp6536.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.



Re: Why Scala?

2014-05-29 Thread Krishna Sankar
Nicholas,
   Good question. Couple of thoughts from my practical experience:

   - Coming from R, Scala feels more natural than other languages. The
   functional  succinctness of Scala is more suited for Data Science than
   other languages. In short, Scala-Spark makes sense, for Data Science, ML,
   Data Exploration et al
   - Having said that occasionally practicality does trump the choice of a
   language - last time I really wanted to use Scala but ended up in writing
   in Python ! Hope to get a better result this time
   - Language evolution is more of a long term granularity -  we do
   underestimate the velocity  impact. Have seen evolutions through languages
   starting from Cobol, CCP/M Basic,Turbo Pascal, ... I think Scala will find
   it's equilibrium sooner than we think ...

Cheers
k/


On Thu, May 29, 2014 at 5:54 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Thank you for the specific points about the advantages Scala provides over
 other languages. Looking at several code samples, the reduction of
 boilerplate code over Java is one of the biggest plusses, to me.

 On Thu, May 29, 2014 at 8:10 PM, Marek Kolodziej mkolod@gmail.com
 wrote:

 I would advise others to form their opinions based on experiencing it for
 themselves, rather than reading what random people say on Hacker News. :)


 Just a nitpick here: What I said was It looks like the language is fairly
 controversial on [Hacker News.] That was just an observation of what I saw
 on HN, not a statement of my opinion. I know very little about Scala (or
 Java, for that matter) and definitely don't have a well-formed opinion on
 the matter.

 Nick