Python vs Scala performance

2014-10-22 Thread Marius Soutier
Hi there, we have a small Spark cluster running and are processing around 40 GB of Gzip-compressed JSON data per day. I have written a couple of word count-like Scala jobs that essentially pull in all the data, do some joins, group bys and aggregations. A job takes around 40 minutes to

RE: Python vs Scala performance

2014-10-22 Thread Ashic Mahtab
on this :) Cheers, Ashic. From: mps@gmail.com Subject: Python vs Scala performance Date: Wed, 22 Oct 2014 12:00:41 +0200 To: user@spark.apache.org Hi there, we have a small Spark cluster running and are processing around 40 GB of Gzip-compressed JSON data per day. I have written a couple

Re: Python vs Scala performance

2014-10-22 Thread Nicholas Chammas
on this :) Cheers, Ashic. From: mps@gmail.com Subject: Python vs Scala performance Date: Wed, 22 Oct 2014 12:00:41 +0200 To: user@spark.apache.org Hi there, we have a small Spark cluster running and are processing around 40 GB of Gzip-compressed JSON data per day. I have

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
fewer interactions with the adapter, that may improve things. Take this with a pinch of salt...I might be way off on this :) Cheers, Ashic. From: mps@gmail.com Subject: Python vs Scala performance Date: Wed, 22 Oct 2014 12:00:41 +0200 To: user@spark.apache.org Hi

Re: Python vs Scala performance

2014-10-22 Thread Nicholas Chammas
, this might be the reason why. If the code can be organised to require fewer interactions with the adapter, that may improve things. Take this with a pinch of salt...I might be way off on this :) Cheers, Ashic. From: mps@gmail.com Subject: Python vs Scala performance Date: Wed, 22

Re: Python vs Scala performance

2014-10-22 Thread Arian Pasquali
with the adapter, that may improve things. Take this with a pinch of salt...I might be way off on this :) Cheers, Ashic. From: mps@gmail.com Subject: Python vs Scala performance Date: Wed, 22 Oct 2014 12:00:41 +0200 To: user@spark.apache.org Hi there, we have a small Spark

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Didn’t seem to help: conf = SparkConf().set(spark.shuffle.spill, false).set(spark.default.parallelism, 12) sc = SparkContext(appName=’app_name', conf = conf) but still taking as much time On 22.10.2014, at 14:17, Nicholas Chammas nicholas.cham...@gmail.com wrote: Total guess without knowing

Re: Python vs Scala performance

2014-10-22 Thread Eustache DIEMERT
on this :) Cheers, Ashic. From: mps@gmail.com Subject: Python vs Scala performance Date: Wed, 22 Oct 2014 12:00:41 +0200 To: user@spark.apache.org Hi there, we have a small Spark cluster running and are processing around 40 GB of Gzip-compressed JSON data per day. I have written a couple

Re: Python vs Scala performance

2014-10-22 Thread Davies Liu
things. Take this with a pinch of salt...I might be way off on this :) Cheers, Ashic. From: mps@gmail.com Subject: Python vs Scala performance Date: Wed, 22 Oct 2014 12:00:41 +0200 To: user@spark.apache.org Hi there, we have a small Spark cluster running and are processing

Re: Python vs Scala performance

2014-10-22 Thread Nicholas Chammas
On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT eusta...@diemert.fr wrote: Wild guess maybe, but do you decode the json records in Python ? it could be much slower as the default lib is quite slow. Oh yeah, this is a good place to look. Also, just upgrading to Python 2.7 may be enough

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Yeah we’re using Python 2.7.3. On 22.10.2014, at 20:06, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT eusta...@diemert.fr wrote: Wild guess maybe, but do you decode the json records in Python ? it could be much slower as the

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Can’t install that on our cluster, but I can try locally. Is there a pre-built binary available? On 22.10.2014, at 19:01, Davies Liu dav...@databricks.com wrote: In the master, you can easily profile you job, find the bottlenecks, see https://github.com/apache/spark/pull/2556 Could you try

Re: Python vs Scala performance

2014-10-22 Thread Davies Liu
Sorry, there is not, you can try clone from github and build it from scratch, see [1] [1] https://github.com/apache/spark Davies On Wed, Oct 22, 2014 at 2:31 PM, Marius Soutier mps@gmail.com wrote: Can’t install that on our cluster, but I can try locally. Is there a pre-built binary