Hi there,
we have a small Spark cluster running and are processing around 40 GB of
Gzip-compressed JSON data per day. I have written a couple of word count-like
Scala jobs that essentially pull in all the data, do some joins, group bys and
aggregations. A job takes around 40 minutes to
on this :)
Cheers,
Ashic.
From: mps@gmail.com
Subject: Python vs Scala performance
Date: Wed, 22 Oct 2014 12:00:41 +0200
To: user@spark.apache.org
Hi there,
we have a small Spark cluster running and are processing around 40 GB of
Gzip-compressed JSON data per day. I have written a couple
on this
:)
Cheers,
Ashic.
From: mps@gmail.com
Subject: Python vs Scala performance
Date: Wed, 22 Oct 2014 12:00:41 +0200
To: user@spark.apache.org
Hi there,
we have a small Spark cluster running and are processing around 40 GB of
Gzip-compressed JSON data per day. I have
fewer interactions with the adapter, that may improve
things. Take this with a pinch of salt...I might be way off on this :)
Cheers,
Ashic.
From: mps@gmail.com
Subject: Python vs Scala performance
Date: Wed, 22 Oct 2014 12:00:41 +0200
To: user@spark.apache.org
Hi
, this might be the reason why. If the code
can be organised to require fewer interactions with the adapter, that may
improve things. Take this with a pinch of salt...I might be way off on this
:)
Cheers,
Ashic.
From: mps@gmail.com
Subject: Python vs Scala performance
Date: Wed, 22
with the adapter,
that may improve things. Take this with a pinch of salt...I might be way
off on this :)
Cheers,
Ashic.
From: mps@gmail.com
Subject: Python vs Scala performance
Date: Wed, 22 Oct 2014 12:00:41 +0200
To: user@spark.apache.org
Hi there,
we have a small Spark
Didn’t seem to help:
conf = SparkConf().set(spark.shuffle.spill,
false).set(spark.default.parallelism, 12)
sc = SparkContext(appName=’app_name', conf = conf)
but still taking as much time
On 22.10.2014, at 14:17, Nicholas Chammas nicholas.cham...@gmail.com wrote:
Total guess without knowing
on this :)
Cheers,
Ashic.
From: mps@gmail.com
Subject: Python vs Scala performance
Date: Wed, 22 Oct 2014 12:00:41 +0200
To: user@spark.apache.org
Hi there,
we have a small Spark cluster running and are processing around 40 GB
of Gzip-compressed JSON data per day. I have written a couple
things. Take this with a pinch of salt...I might be way off on this :)
Cheers,
Ashic.
From: mps@gmail.com
Subject: Python vs Scala performance
Date: Wed, 22 Oct 2014 12:00:41 +0200
To: user@spark.apache.org
Hi there,
we have a small Spark cluster running and are processing
On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT eusta...@diemert.fr
wrote:
Wild guess maybe, but do you decode the json records in Python ? it could
be much slower as the default lib is quite slow.
Oh yeah, this is a good place to look. Also, just upgrading to Python 2.7
may be enough
Yeah we’re using Python 2.7.3.
On 22.10.2014, at 20:06, Nicholas Chammas nicholas.cham...@gmail.com wrote:
On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT eusta...@diemert.fr
wrote:
Wild guess maybe, but do you decode the json records in Python ? it could be
much slower as the
Can’t install that on our cluster, but I can try locally. Is there a pre-built
binary available?
On 22.10.2014, at 19:01, Davies Liu dav...@databricks.com wrote:
In the master, you can easily profile you job, find the bottlenecks,
see https://github.com/apache/spark/pull/2556
Could you try
Sorry, there is not, you can try clone from github and build it from
scratch, see [1]
[1] https://github.com/apache/spark
Davies
On Wed, Oct 22, 2014 at 2:31 PM, Marius Soutier mps@gmail.com wrote:
Can’t install that on our cluster, but I can try locally. Is there a
pre-built binary
13 matches
Mail list logo