In the master, you can easily profile you job, find the bottlenecks, see https://github.com/apache/spark/pull/2556
Could you try it and show the stats? Davies On Wed, Oct 22, 2014 at 7:51 AM, Marius Soutier <mps....@gmail.com> wrote: > It’s an AWS cluster that is rather small at the moment, 4 worker nodes @ 28 > GB RAM and 4 cores, but fast enough for the currently 40 Gigs a day. Data is > on HDFS in EBS volumes. Each file is a Gzip-compress collection of JSON > objects, each one between 115-120 MB to be near the HDFS block size. > > One core per worker is permanently used by a job that allows SQL queries > over Parquet files. > > On 22.10.2014, at 16:18, Arian Pasquali <ar...@arianpasquali.com> wrote: > > Interesting thread Marius, > Btw, I'm curious about your cluster size. > How small it is in terms of ram and cores. > > Arian > > 2014-10-22 13:17 GMT+01:00 Nicholas Chammas <nicholas.cham...@gmail.com>: >> >> Total guess without knowing anything about your code: Do either of these >> two notes from the 1.1.0 release notes affect things at all? >> >> PySpark now performs external spilling during aggregations. Old behavior >> can be restored by setting spark.shuffle.spill to false. >> PySpark uses a new heuristic for determining the parallelism of shuffle >> operations. Old behavior can be restored by setting >> spark.default.parallelism to the number of cores in the cluster. >> >> Nick >> >> >> On Wed, Oct 22, 2014 at 7:29 AM, Marius Soutier <mps....@gmail.com> wrote: >>> >>> We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but >>> not that... >>> >>> On 22.10.2014, at 13:02, Nicholas Chammas <nicholas.cham...@gmail.com> >>> wrote: >>> >>> What version of Spark are you running? Some recent changes to how PySpark >>> works relative to Scala Spark may explain things. >>> >>> PySpark should not be that much slower, not by a stretch. >>> >>> On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab <as...@live.com> wrote: >>>> >>>> I'm no expert, but looked into how the python bits work a while back >>>> (was trying to assess what it would take to add F# support). It seems >>>> python >>>> hosts a jvm inside of it, and talks to "scala spark" in that jvm. The >>>> python >>>> server bit "translates" the python calls to those in the jvm. The python >>>> spark context is like an adapter to the jvm spark context. If you're seeing >>>> performance discrepancies, this might be the reason why. If the code can be >>>> organised to require fewer interactions with the adapter, that may improve >>>> things. Take this with a pinch of salt...I might be way off on this :) >>>> >>>> Cheers, >>>> Ashic. >>>> >>>> > From: mps....@gmail.com >>>> > Subject: Python vs Scala performance >>>> > Date: Wed, 22 Oct 2014 12:00:41 +0200 >>>> > To: user@spark.apache.org >>>> >>>> > >>>> > Hi there, >>>> > >>>> > we have a small Spark cluster running and are processing around 40 GB >>>> > of Gzip-compressed JSON data per day. I have written a couple of word >>>> > count-like Scala jobs that essentially pull in all the data, do some >>>> > joins, >>>> > group bys and aggregations. A job takes around 40 minutes to complete. >>>> > >>>> > Now one of the data scientists on the team wants to do write some jobs >>>> > using Python. To learn Spark, he rewrote one of my Scala jobs in Python. >>>> > From the API-side, everything looks more or less identical. However his >>>> > jobs >>>> > take between 5-8 hours to complete! We can also see that the execution >>>> > plan >>>> > is quite different, I’m seeing writes to the output much later than in >>>> > Scala. >>>> > >>>> > Is Python I/O really that slow? >>>> > >>>> > >>>> > Thanks >>>> > - Marius >>>> > >>>> > >>>> > --------------------------------------------------------------------- >>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> > For additional commands, e-mail: user-h...@spark.apache.org >>>> > >>> >>> >>> >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org