Total guess without knowing anything about your code: Do either of these
two notes from the 1.1.0 release notes
<http://spark.apache.org/releases/spark-release-1-1-0.html> affect things
at all?


   - PySpark now performs external spilling during aggregations. Old
   behavior can be restored by setting spark.shuffle.spill to false.
   - PySpark uses a new heuristic for determining the parallelism of
   shuffle operations. Old behavior can be restored by setting
   spark.default.parallelism to the number of cores in the cluster.

 Nick
​

On Wed, Oct 22, 2014 at 7:29 AM, Marius Soutier <mps....@gmail.com> wrote:

> We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but not
> that...
>
> On 22.10.2014, at 13:02, Nicholas Chammas <nicholas.cham...@gmail.com>
> wrote:
>
> What version of Spark are you running? Some recent changes
> <https://spark.apache.org/releases/spark-release-1-1-0.html> to how
> PySpark works relative to Scala Spark may explain things.
>
> PySpark should not be that much slower, not by a stretch.
>
> On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab <as...@live.com> wrote:
>
>> I'm no expert, but looked into how the python bits work a while back (was
>> trying to assess what it would take to add F# support). It seems python
>> hosts a jvm inside of it, and talks to "scala spark" in that jvm. The
>> python server bit "translates" the python calls to those in the jvm. The
>> python spark context is like an adapter to the jvm spark context. If you're
>> seeing performance discrepancies, this might be the reason why. If the code
>> can be organised to require fewer interactions with the adapter, that may
>> improve things. Take this with a pinch of salt...I might be way off on this
>> :)
>>
>> Cheers,
>> Ashic.
>>
>> > From: mps....@gmail.com
>> > Subject: Python vs Scala performance
>> > Date: Wed, 22 Oct 2014 12:00:41 +0200
>> > To: user@spark.apache.org
>>
>> >
>> > Hi there,
>> >
>> > we have a small Spark cluster running and are processing around 40 GB
>> of Gzip-compressed JSON data per day. I have written a couple of word
>> count-like Scala jobs that essentially pull in all the data, do some joins,
>> group bys and aggregations. A job takes around 40 minutes to complete.
>> >
>> > Now one of the data scientists on the team wants to do write some jobs
>> using Python. To learn Spark, he rewrote one of my Scala jobs in Python.
>> From the API-side, everything looks more or less identical. However his
>> jobs take between 5-8 hours to complete! We can also see that the execution
>> plan is quite different, I’m seeing writes to the output much later than in
>> Scala.
>> >
>> > Is Python I/O really that slow?
>> >
>> >
>> > Thanks
>> > - Marius
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>
>
>

Reply via email to