We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but not 
that...

On 22.10.2014, at 13:02, Nicholas Chammas <nicholas.cham...@gmail.com> wrote:

> What version of Spark are you running? Some recent changes to how PySpark 
> works relative to Scala Spark may explain things.
> 
> PySpark should not be that much slower, not by a stretch.
> 
> On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab <as...@live.com> wrote:
> I'm no expert, but looked into how the python bits work a while back (was 
> trying to assess what it would take to add F# support). It seems python hosts 
> a jvm inside of it, and talks to "scala spark" in that jvm. The python server 
> bit "translates" the python calls to those in the jvm. The python spark 
> context is like an adapter to the jvm spark context. If you're seeing 
> performance discrepancies, this might be the reason why. If the code can be 
> organised to require fewer interactions with the adapter, that may improve 
> things. Take this with a pinch of salt...I might be way off on this :)
> 
> Cheers,
> Ashic.
> 
> > From: mps....@gmail.com
> > Subject: Python vs Scala performance
> > Date: Wed, 22 Oct 2014 12:00:41 +0200
> > To: user@spark.apache.org
> 
> > 
> > Hi there,
> > 
> > we have a small Spark cluster running and are processing around 40 GB of 
> > Gzip-compressed JSON data per day. I have written a couple of word 
> > count-like Scala jobs that essentially pull in all the data, do some joins, 
> > group bys and aggregations. A job takes around 40 minutes to complete.
> > 
> > Now one of the data scientists on the team wants to do write some jobs 
> > using Python. To learn Spark, he rewrote one of my Scala jobs in Python. 
> > From the API-side, everything looks more or less identical. However his 
> > jobs take between 5-8 hours to complete! We can also see that the execution 
> > plan is quite different, I’m seeing writes to the output much later than in 
> > Scala.
> > 
> > Is Python I/O really that slow?
> > 
> > 
> > Thanks
> > - Marius
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> > 
> 

Reply via email to