Re: why is spark + scala code so slow, compared to python?

Sean Owen Thu, 11 Dec 2014 11:42:22 -0800

In general, you would not expect a distributed computation framework
to perform nearly as fast as a non-distributed one, when both are run
on one machine. Spark has so much more overhead that doesn't go away
just because it's on one machine. Of course, that's the very reason it
scales past one machine too.


That said, you may also not be using Spark optimally, whereas you
probably use the tool you know optimally. You may not be comparing
algorithms apples-to-apples either.

Don't use Spark for its own sake. Use it because you need it for the
things it does, like scaling up or integrating with other components.
If you really have a small, isolated ML problem, you probably want to
use your familiar local tools.

On Thu, Dec 11, 2014 at 7:18 PM, Duy Huynh <duy.huynh....@gmail.com> wrote:
> both.
>
> first, the distributed version is so much slower than python.  i tried a few
> things like broadcasting variables, replacing Seq with Array, and a few
> other little things.  it helps to improve the performance, but still slower
> than the python code.
>
> so, i wrote a local version that's pretty much just running a bunch of
> breeze/blas operations.  i guess that's purely scala (no spark).  this local
> version is faster than the distributed version but still much slower than
> the python code.
>
>
>
>
>
>
>
> On Thu, Dec 11, 2014 at 2:09 PM, Natu Lauchande <nlaucha...@gmail.com>
> wrote:
>>
>> Are you using Scala in a distributed enviroment or in a standalone mode ?
>>
>> Natu
>>
>> On Thu, Dec 11, 2014 at 8:23 PM, ll <duy.huynh....@gmail.com> wrote:
>>>
>>> hi.. i'm converting some of my machine learning python code into scala +
>>> spark.  i haven't been able to run it on large dataset yet, but on small
>>> datasets (like http://yann.lecun.com/exdb/mnist/), my spark + scala code
>>> is
>>> much slower than my python code (5 to 10 times slower than python)
>>>
>>> i already tried everything to improve my spark + scala code like
>>> broadcasting variables, caching the RDD, replacing all my matrix/vector
>>> operations with breeze/blas, etc.  i saw some improvements, but it's
>>> still a
>>> lot slower than my python code.
>>>
>>> why is that?
>>>
>>> how do you improve your spark + scala performance today?
>>>
>>> or is spark + scala just not the right tool for small to medium datasets?
>>>
>>> when would you use spark + scala vs. python?
>>>
>>> thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/why-is-spark-scala-code-so-slow-compared-to-python-tp20636.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: why is spark + scala code so slow, compared to python?

Reply via email to