I'm no expert, but looked into how the python bits work a while back (was 
trying to assess what it would take to add F# support). It seems python hosts a 
jvm inside of it, and talks to "scala spark" in that jvm. The python server bit 
"translates" the python calls to those in the jvm. The python spark context is 
like an adapter to the jvm spark context. If you're seeing performance 
discrepancies, this might be the reason why. If the code can be organised to 
require fewer interactions with the adapter, that may improve things. Take this 
with a pinch of salt...I might be way off on this :)

Cheers,
Ashic.

> From: mps....@gmail.com
> Subject: Python vs Scala performance
> Date: Wed, 22 Oct 2014 12:00:41 +0200
> To: user@spark.apache.org
> 
> Hi there,
> 
> we have a small Spark cluster running and are processing around 40 GB of 
> Gzip-compressed JSON data per day. I have written a couple of word count-like 
> Scala jobs that essentially pull in all the data, do some joins, group bys 
> and aggregations. A job takes around 40 minutes to complete.
> 
> Now one of the data scientists on the team wants to do write some jobs using 
> Python. To learn Spark, he rewrote one of my Scala jobs in Python. From the 
> API-side, everything looks more or less identical. However his jobs take 
> between 5-8 hours to complete! We can also see that the execution plan is 
> quite different, I’m seeing writes to the output much later than in Scala.
> 
> Is Python I/O really that slow?
> 
> 
> Thanks
> - Marius
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
                                          

Reply via email to