It’s an AWS cluster that is rather small at the moment, 4 worker nodes @ 28 GB 
RAM and 4 cores, but fast enough for the currently 40 Gigs a day. Data is on 
HDFS in EBS volumes. Each file is a Gzip-compress collection of JSON objects, 
each one between 115-120 MB to be near the HDFS block size.

One core per worker is permanently used by a job that allows SQL queries over 
Parquet files.

On 22.10.2014, at 16:18, Arian Pasquali <ar...@arianpasquali.com> wrote:

> Interesting thread Marius,
> Btw, I'm curious about your cluster size. 
> How small it is in terms of ram and cores.
> 
> Arian
> 
> 2014-10-22 13:17 GMT+01:00 Nicholas Chammas <nicholas.cham...@gmail.com>:
> Total guess without knowing anything about your code: Do either of these two 
> notes from the 1.1.0 release notes affect things at all?
> 
> PySpark now performs external spilling during aggregations. Old behavior can 
> be restored by setting spark.shuffle.spill to false.
> PySpark uses a new heuristic for determining the parallelism of shuffle 
> operations. Old behavior can be restored by setting spark.default.parallelism 
> to the number of cores in the cluster.
> Nick
> 
> ​
> 
> On Wed, Oct 22, 2014 at 7:29 AM, Marius Soutier <mps....@gmail.com> wrote:
> We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but not 
> that...
> 
> On 22.10.2014, at 13:02, Nicholas Chammas <nicholas.cham...@gmail.com> wrote:
> 
>> What version of Spark are you running? Some recent changes to how PySpark 
>> works relative to Scala Spark may explain things.
>> 
>> PySpark should not be that much slower, not by a stretch.
>> 
>> On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab <as...@live.com> wrote:
>> I'm no expert, but looked into how the python bits work a while back (was 
>> trying to assess what it would take to add F# support). It seems python 
>> hosts a jvm inside of it, and talks to "scala spark" in that jvm. The python 
>> server bit "translates" the python calls to those in the jvm. The python 
>> spark context is like an adapter to the jvm spark context. If you're seeing 
>> performance discrepancies, this might be the reason why. If the code can be 
>> organised to require fewer interactions with the adapter, that may improve 
>> things. Take this with a pinch of salt...I might be way off on this :)
>> 
>> Cheers,
>> Ashic.
>> 
>> > From: mps....@gmail.com
>> > Subject: Python vs Scala performance
>> > Date: Wed, 22 Oct 2014 12:00:41 +0200
>> > To: user@spark.apache.org
>> 
>> > 
>> > Hi there,
>> > 
>> > we have a small Spark cluster running and are processing around 40 GB of 
>> > Gzip-compressed JSON data per day. I have written a couple of word 
>> > count-like Scala jobs that essentially pull in all the data, do some 
>> > joins, group bys and aggregations. A job takes around 40 minutes to 
>> > complete.
>> > 
>> > Now one of the data scientists on the team wants to do write some jobs 
>> > using Python. To learn Spark, he rewrote one of my Scala jobs in Python. 
>> > From the API-side, everything looks more or less identical. However his 
>> > jobs take between 5-8 hours to complete! We can also see that the 
>> > execution plan is quite different, I’m seeing writes to the output much 
>> > later than in Scala.
>> > 
>> > Is Python I/O really that slow?
>> > 
>> > 
>> > Thanks
>> > - Marius
>> > 
>> > 
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> > 
>> 
> 
> 
> 

Reply via email to