Interesting thread Marius, Btw, I'm curious about your cluster size. How small it is in terms of ram and cores.
Arian 2014-10-22 13:17 GMT+01:00 Nicholas Chammas <nicholas.cham...@gmail.com>: > Total guess without knowing anything about your code: Do either of these > two notes from the 1.1.0 release notes > <http://spark.apache.org/releases/spark-release-1-1-0.html> affect things > at all? > > > - PySpark now performs external spilling during aggregations. Old > behavior can be restored by setting spark.shuffle.spill to false. > - PySpark uses a new heuristic for determining the parallelism of > shuffle operations. Old behavior can be restored by setting > spark.default.parallelism to the number of cores in the cluster. > > Nick > > > On Wed, Oct 22, 2014 at 7:29 AM, Marius Soutier <mps....@gmail.com> wrote: > >> We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but >> not that... >> >> On 22.10.2014, at 13:02, Nicholas Chammas <nicholas.cham...@gmail.com> >> wrote: >> >> What version of Spark are you running? Some recent changes >> <https://spark.apache.org/releases/spark-release-1-1-0.html> to how >> PySpark works relative to Scala Spark may explain things. >> >> PySpark should not be that much slower, not by a stretch. >> >> On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab <as...@live.com> wrote: >> >>> I'm no expert, but looked into how the python bits work a while back >>> (was trying to assess what it would take to add F# support). It seems >>> python hosts a jvm inside of it, and talks to "scala spark" in that jvm. >>> The python server bit "translates" the python calls to those in the jvm. >>> The python spark context is like an adapter to the jvm spark context. If >>> you're seeing performance discrepancies, this might be the reason why. If >>> the code can be organised to require fewer interactions with the adapter, >>> that may improve things. Take this with a pinch of salt...I might be way >>> off on this :) >>> >>> Cheers, >>> Ashic. >>> >>> > From: mps....@gmail.com >>> > Subject: Python vs Scala performance >>> > Date: Wed, 22 Oct 2014 12:00:41 +0200 >>> > To: user@spark.apache.org >>> >>> > >>> > Hi there, >>> > >>> > we have a small Spark cluster running and are processing around 40 GB >>> of Gzip-compressed JSON data per day. I have written a couple of word >>> count-like Scala jobs that essentially pull in all the data, do some joins, >>> group bys and aggregations. A job takes around 40 minutes to complete. >>> > >>> > Now one of the data scientists on the team wants to do write some jobs >>> using Python. To learn Spark, he rewrote one of my Scala jobs in Python. >>> From the API-side, everything looks more or less identical. However his >>> jobs take between 5-8 hours to complete! We can also see that the execution >>> plan is quite different, I’m seeing writes to the output much later than in >>> Scala. >>> > >>> > Is Python I/O really that slow? >>> > >>> > >>> > Thanks >>> > - Marius >>> > >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> > For additional commands, e-mail: user-h...@spark.apache.org >>> > >>> >> >> >> >