both. first, the distributed version is so much slower than python. i tried a few things like broadcasting variables, replacing Seq with Array, and a few other little things. it helps to improve the performance, but still slower than the python code.
so, i wrote a local version that's pretty much just running a bunch of breeze/blas operations. i guess that's purely scala (no spark). this local version is faster than the distributed version but still much slower than the python code. On Thu, Dec 11, 2014 at 2:09 PM, Natu Lauchande <nlaucha...@gmail.com> wrote: > Are you using Scala in a distributed enviroment or in a standalone mode ? > > Natu > > On Thu, Dec 11, 2014 at 8:23 PM, ll <duy.huynh....@gmail.com> wrote: > >> hi.. i'm converting some of my machine learning python code into scala + >> spark. i haven't been able to run it on large dataset yet, but on small >> datasets (like http://yann.lecun.com/exdb/mnist/), my spark + scala code >> is >> much slower than my python code (5 to 10 times slower than python) >> >> i already tried everything to improve my spark + scala code like >> broadcasting variables, caching the RDD, replacing all my matrix/vector >> operations with breeze/blas, etc. i saw some improvements, but it's >> still a >> lot slower than my python code. >> >> why is that? >> >> how do you improve your spark + scala performance today? >> >> or is spark + scala just not the right tool for small to medium datasets? >> >> when would you use spark + scala vs. python? >> >> thanks! >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/why-is-spark-scala-code-so-slow-compared-to-python-tp20636.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >