I'm no expert, but looked into how the python bits work a while back (was
trying to assess what it would take to add F# support). It seems python hosts a
jvm inside of it, and talks to scala spark in that jvm. The python server bit
translates the python calls to those in the jvm. The python
What version of Spark are you running? Some recent changes
https://spark.apache.org/releases/spark-release-1-1-0.html to how PySpark
works relative to Scala Spark may explain things.
PySpark should not be that much slower, not by a stretch.
On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab
We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but not
that...
On 22.10.2014, at 13:02, Nicholas Chammas nicholas.cham...@gmail.com wrote:
What version of Spark are you running? Some recent changes to how PySpark
works relative to Scala Spark may explain things.
Total guess without knowing anything about your code: Do either of these
two notes from the 1.1.0 release notes
http://spark.apache.org/releases/spark-release-1-1-0.html affect things
at all?
- PySpark now performs external spilling during aggregations. Old
behavior can be restored by
Interesting thread Marius,
Btw, I'm curious about your cluster size.
How small it is in terms of ram and cores.
Arian
2014-10-22 13:17 GMT+01:00 Nicholas Chammas nicholas.cham...@gmail.com:
Total guess without knowing anything about your code: Do either of these
two notes from the 1.1.0
Didn’t seem to help:
conf = SparkConf().set(spark.shuffle.spill,
false).set(spark.default.parallelism, 12)
sc = SparkContext(appName=’app_name', conf = conf)
but still taking as much time
On 22.10.2014, at 14:17, Nicholas Chammas nicholas.cham...@gmail.com wrote:
Total guess without knowing
Wild guess maybe, but do you decode the json records in Python ? it could
be much slower as the default lib is quite slow.
If so try ujson [1] - a C implementation that is at least an order of
magnitude faster.
HTH
[1] https://pypi.python.org/pypi/ujson
2014-10-22 16:51 GMT+02:00 Marius
In the master, you can easily profile you job, find the bottlenecks,
see https://github.com/apache/spark/pull/2556
Could you try it and show the stats?
Davies
On Wed, Oct 22, 2014 at 7:51 AM, Marius Soutier mps@gmail.com wrote:
It’s an AWS cluster that is rather small at the moment, 4
On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT eusta...@diemert.fr
wrote:
Wild guess maybe, but do you decode the json records in Python ? it could
be much slower as the default lib is quite slow.
Oh yeah, this is a good place to look. Also, just upgrading to Python 2.7
may be enough
Yeah we’re using Python 2.7.3.
On 22.10.2014, at 20:06, Nicholas Chammas nicholas.cham...@gmail.com wrote:
On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT eusta...@diemert.fr
wrote:
Wild guess maybe, but do you decode the json records in Python ? it could be
much slower as the
Can’t install that on our cluster, but I can try locally. Is there a pre-built
binary available?
On 22.10.2014, at 19:01, Davies Liu dav...@databricks.com wrote:
In the master, you can easily profile you job, find the bottlenecks,
see https://github.com/apache/spark/pull/2556
Could you try
Sorry, there is not, you can try clone from github and build it from
scratch, see [1]
[1] https://github.com/apache/spark
Davies
On Wed, Oct 22, 2014 at 2:31 PM, Marius Soutier mps@gmail.com wrote:
Can’t install that on our cluster, but I can try locally. Is there a
pre-built binary
12 matches
Mail list logo