This is showing a factor of 200 between python and scala and 1400 when distributed.
Is this really accurate? If not, what is the real performance difference expected on average between the 3 cases? On Thu, Dec 11, 2014 at 11:33 AM, Duy Huynh <duy.huynh....@gmail.com> wrote: > just to give some reference point. with the same algorithm running on > mnist dataset. > > 1. python implementation: ~10 miliseconds per iteration (can be faster > if i switch to gpu) > > 2. local version (scala + breeze): ~2 seconds per iteration > > 3. distributed version (spark + scala + breeze): 15 seconds per iteration > > i love spark and really enjoy writing scala code. but this huge > difference in performance makes it really hard to do any kind of machine > learning work. > > > > > On Thu, Dec 11, 2014 at 2:18 PM, Duy Huynh <duy.huynh....@gmail.com> > wrote: > >> both. >> >> first, the distributed version is so much slower than python. i tried a >> few things like broadcasting variables, replacing Seq with Array, and a few >> other little things. it helps to improve the performance, but still slower >> than the python code. >> >> so, i wrote a local version that's pretty much just running a bunch of >> breeze/blas operations. i guess that's purely scala (no spark). this >> local version is faster than the distributed version but still much slower >> than the python code. >> >> >> >> >> >> >> >> On Thu, Dec 11, 2014 at 2:09 PM, Natu Lauchande <nlaucha...@gmail.com> >> wrote: >> >>> Are you using Scala in a distributed enviroment or in a standalone mode ? >>> >>> Natu >>> >>> On Thu, Dec 11, 2014 at 8:23 PM, ll <duy.huynh....@gmail.com> wrote: >>> >>>> hi.. i'm converting some of my machine learning python code into scala + >>>> spark. i haven't been able to run it on large dataset yet, but on small >>>> datasets (like http://yann.lecun.com/exdb/mnist/), my spark + scala >>>> code is >>>> much slower than my python code (5 to 10 times slower than python) >>>> >>>> i already tried everything to improve my spark + scala code like >>>> broadcasting variables, caching the RDD, replacing all my matrix/vector >>>> operations with breeze/blas, etc. i saw some improvements, but it's >>>> still a >>>> lot slower than my python code. >>>> >>>> why is that? >>>> >>>> how do you improve your spark + scala performance today? >>>> >>>> or is spark + scala just not the right tool for small to medium >>>> datasets? >>>> >>>> when would you use spark + scala vs. python? >>>> >>>> thanks! >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/why-is-spark-scala-code-so-slow-compared-to-python-tp20636.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >> >