Re: why is spark + scala code so slow, compared to python?

2014-12-12 Thread rzykov
Try this
https://github.com/RetailRocket/SparkMultiTool
https://github.com/RetailRocket/SparkMultiTool  

This loader solved slow reading of a big data set of small files in hdfs. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-is-spark-scala-code-so-slow-compared-to-python-tp20636p20657.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



why is spark + scala code so slow, compared to python?

2014-12-11 Thread ll
hi.. i'm converting some of my machine learning python code into scala +
spark.  i haven't been able to run it on large dataset yet, but on small
datasets (like http://yann.lecun.com/exdb/mnist/), my spark + scala code is
much slower than my python code (5 to 10 times slower than python)

i already tried everything to improve my spark + scala code like
broadcasting variables, caching the RDD, replacing all my matrix/vector
operations with breeze/blas, etc.  i saw some improvements, but it's still a
lot slower than my python code.

why is that?  

how do you improve your spark + scala performance today?  

or is spark + scala just not the right tool for small to medium datasets?

when would you use spark + scala vs. python?

thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-is-spark-scala-code-so-slow-compared-to-python-tp20636.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Natu Lauchande
Are you using Scala in a distributed enviroment or in a standalone mode ?

Natu

On Thu, Dec 11, 2014 at 8:23 PM, ll duy.huynh@gmail.com wrote:

 hi.. i'm converting some of my machine learning python code into scala +
 spark.  i haven't been able to run it on large dataset yet, but on small
 datasets (like http://yann.lecun.com/exdb/mnist/), my spark + scala code
 is
 much slower than my python code (5 to 10 times slower than python)

 i already tried everything to improve my spark + scala code like
 broadcasting variables, caching the RDD, replacing all my matrix/vector
 operations with breeze/blas, etc.  i saw some improvements, but it's still
 a
 lot slower than my python code.

 why is that?

 how do you improve your spark + scala performance today?

 or is spark + scala just not the right tool for small to medium datasets?

 when would you use spark + scala vs. python?

 thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/why-is-spark-scala-code-so-slow-compared-to-python-tp20636.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Duy Huynh
both.

first, the distributed version is so much slower than python.  i tried a
few things like broadcasting variables, replacing Seq with Array, and a few
other little things.  it helps to improve the performance, but still slower
than the python code.

so, i wrote a local version that's pretty much just running a bunch of
breeze/blas operations.  i guess that's purely scala (no spark).  this
local version is faster than the distributed version but still much slower
than the python code.







On Thu, Dec 11, 2014 at 2:09 PM, Natu Lauchande nlaucha...@gmail.com
wrote:

 Are you using Scala in a distributed enviroment or in a standalone mode ?

 Natu

 On Thu, Dec 11, 2014 at 8:23 PM, ll duy.huynh@gmail.com wrote:

 hi.. i'm converting some of my machine learning python code into scala +
 spark.  i haven't been able to run it on large dataset yet, but on small
 datasets (like http://yann.lecun.com/exdb/mnist/), my spark + scala code
 is
 much slower than my python code (5 to 10 times slower than python)

 i already tried everything to improve my spark + scala code like
 broadcasting variables, caching the RDD, replacing all my matrix/vector
 operations with breeze/blas, etc.  i saw some improvements, but it's
 still a
 lot slower than my python code.

 why is that?

 how do you improve your spark + scala performance today?

 or is spark + scala just not the right tool for small to medium datasets?

 when would you use spark + scala vs. python?

 thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/why-is-spark-scala-code-so-slow-compared-to-python-tp20636.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Duy Huynh
just to give some reference point.  with the same algorithm running on
mnist dataset.

1.  python implementation:  ~10 miliseconds per iteration (can be faster if
i switch to gpu)

2.  local version (scala + breeze):  ~2 seconds per iteration

3.  distributed version (spark + scala + breeze):  15 seconds per iteration

i love spark and really enjoy writing scala code.  but this huge difference
in performance makes it really hard to do any kind of machine learning work.




On Thu, Dec 11, 2014 at 2:18 PM, Duy Huynh duy.huynh@gmail.com wrote:

 both.

 first, the distributed version is so much slower than python.  i tried a
 few things like broadcasting variables, replacing Seq with Array, and a few
 other little things.  it helps to improve the performance, but still slower
 than the python code.

 so, i wrote a local version that's pretty much just running a bunch of
 breeze/blas operations.  i guess that's purely scala (no spark).  this
 local version is faster than the distributed version but still much slower
 than the python code.







 On Thu, Dec 11, 2014 at 2:09 PM, Natu Lauchande nlaucha...@gmail.com
 wrote:

 Are you using Scala in a distributed enviroment or in a standalone mode ?

 Natu

 On Thu, Dec 11, 2014 at 8:23 PM, ll duy.huynh@gmail.com wrote:

 hi.. i'm converting some of my machine learning python code into scala +
 spark.  i haven't been able to run it on large dataset yet, but on small
 datasets (like http://yann.lecun.com/exdb/mnist/), my spark + scala
 code is
 much slower than my python code (5 to 10 times slower than python)

 i already tried everything to improve my spark + scala code like
 broadcasting variables, caching the RDD, replacing all my matrix/vector
 operations with breeze/blas, etc.  i saw some improvements, but it's
 still a
 lot slower than my python code.

 why is that?

 how do you improve your spark + scala performance today?

 or is spark + scala just not the right tool for small to medium datasets?

 when would you use spark + scala vs. python?

 thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/why-is-spark-scala-code-so-slow-compared-to-python-tp20636.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Sean Owen
In general, you would not expect a distributed computation framework
to perform nearly as fast as a non-distributed one, when both are run
on one machine. Spark has so much more overhead that doesn't go away
just because it's on one machine. Of course, that's the very reason it
scales past one machine too.

That said, you may also not be using Spark optimally, whereas you
probably use the tool you know optimally. You may not be comparing
algorithms apples-to-apples either.

Don't use Spark for its own sake. Use it because you need it for the
things it does, like scaling up or integrating with other components.
If you really have a small, isolated ML problem, you probably want to
use your familiar local tools.

On Thu, Dec 11, 2014 at 7:18 PM, Duy Huynh duy.huynh@gmail.com wrote:
 both.

 first, the distributed version is so much slower than python.  i tried a few
 things like broadcasting variables, replacing Seq with Array, and a few
 other little things.  it helps to improve the performance, but still slower
 than the python code.

 so, i wrote a local version that's pretty much just running a bunch of
 breeze/blas operations.  i guess that's purely scala (no spark).  this local
 version is faster than the distributed version but still much slower than
 the python code.







 On Thu, Dec 11, 2014 at 2:09 PM, Natu Lauchande nlaucha...@gmail.com
 wrote:

 Are you using Scala in a distributed enviroment or in a standalone mode ?

 Natu

 On Thu, Dec 11, 2014 at 8:23 PM, ll duy.huynh@gmail.com wrote:

 hi.. i'm converting some of my machine learning python code into scala +
 spark.  i haven't been able to run it on large dataset yet, but on small
 datasets (like http://yann.lecun.com/exdb/mnist/), my spark + scala code
 is
 much slower than my python code (5 to 10 times slower than python)

 i already tried everything to improve my spark + scala code like
 broadcasting variables, caching the RDD, replacing all my matrix/vector
 operations with breeze/blas, etc.  i saw some improvements, but it's
 still a
 lot slower than my python code.

 why is that?

 how do you improve your spark + scala performance today?

 or is spark + scala just not the right tool for small to medium datasets?

 when would you use spark + scala vs. python?

 thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/why-is-spark-scala-code-so-slow-compared-to-python-tp20636.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Andy Wagner
This is showing a factor of 200 between python and scala and 1400 when
distributed.

Is this really accurate?
If not, what is the real performance difference expected on average between
the 3 cases?


On Thu, Dec 11, 2014 at 11:33 AM, Duy Huynh duy.huynh@gmail.com wrote:

 just to give some reference point.  with the same algorithm running on
 mnist dataset.

 1.  python implementation:  ~10 miliseconds per iteration (can be faster
 if i switch to gpu)

 2.  local version (scala + breeze):  ~2 seconds per iteration

 3.  distributed version (spark + scala + breeze):  15 seconds per iteration

 i love spark and really enjoy writing scala code.  but this huge
 difference in performance makes it really hard to do any kind of machine
 learning work.




 On Thu, Dec 11, 2014 at 2:18 PM, Duy Huynh duy.huynh@gmail.com
 wrote:

 both.

 first, the distributed version is so much slower than python.  i tried a
 few things like broadcasting variables, replacing Seq with Array, and a few
 other little things.  it helps to improve the performance, but still slower
 than the python code.

 so, i wrote a local version that's pretty much just running a bunch of
 breeze/blas operations.  i guess that's purely scala (no spark).  this
 local version is faster than the distributed version but still much slower
 than the python code.







 On Thu, Dec 11, 2014 at 2:09 PM, Natu Lauchande nlaucha...@gmail.com
 wrote:

 Are you using Scala in a distributed enviroment or in a standalone mode ?

 Natu

 On Thu, Dec 11, 2014 at 8:23 PM, ll duy.huynh@gmail.com wrote:

 hi.. i'm converting some of my machine learning python code into scala +
 spark.  i haven't been able to run it on large dataset yet, but on small
 datasets (like http://yann.lecun.com/exdb/mnist/), my spark + scala
 code is
 much slower than my python code (5 to 10 times slower than python)

 i already tried everything to improve my spark + scala code like
 broadcasting variables, caching the RDD, replacing all my matrix/vector
 operations with breeze/blas, etc.  i saw some improvements, but it's
 still a
 lot slower than my python code.

 why is that?

 how do you improve your spark + scala performance today?

 or is spark + scala just not the right tool for small to medium
 datasets?

 when would you use spark + scala vs. python?

 thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/why-is-spark-scala-code-so-slow-compared-to-python-tp20636.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org