What version of Spark are you running? Some recent changes <https://spark.apache.org/releases/spark-release-1-1-0.html> to how PySpark works relative to Scala Spark may explain things.
PySpark should not be that much slower, not by a stretch. On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab <as...@live.com> wrote: > I'm no expert, but looked into how the python bits work a while back (was > trying to assess what it would take to add F# support). It seems python > hosts a jvm inside of it, and talks to "scala spark" in that jvm. The > python server bit "translates" the python calls to those in the jvm. The > python spark context is like an adapter to the jvm spark context. If you're > seeing performance discrepancies, this might be the reason why. If the code > can be organised to require fewer interactions with the adapter, that may > improve things. Take this with a pinch of salt...I might be way off on this > :) > > Cheers, > Ashic. > > > From: mps....@gmail.com > > Subject: Python vs Scala performance > > Date: Wed, 22 Oct 2014 12:00:41 +0200 > > To: user@spark.apache.org > > > > > Hi there, > > > > we have a small Spark cluster running and are processing around 40 GB of > Gzip-compressed JSON data per day. I have written a couple of word > count-like Scala jobs that essentially pull in all the data, do some joins, > group bys and aggregations. A job takes around 40 minutes to complete. > > > > Now one of the data scientists on the team wants to do write some jobs > using Python. To learn Spark, he rewrote one of my Scala jobs in Python. > From the API-side, everything looks more or less identical. However his > jobs take between 5-8 hours to complete! We can also see that the execution > plan is quite different, I’m seeing writes to the output much later than in > Scala. > > > > Is Python I/O really that slow? > > > > > > Thanks > > - Marius > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > >