Hi Oleg,
Those parameters control the number and size of Spark's daemons on the
cluster. If you're interested in how these daemons relate to each other
and interact with YARN, I wrote a post on this a little while ago -
Great.
Upgrade helped.
Still need some inputs:
1) Is there any log files of spark job execution?
2) Where can I read about tuning / parameter configuration:
For example:
--num-executors 12
--driver-memory 4g
--executor-memory 2g
what is the meaning of thous parameters?
Thanks
Oleg.
On Thu,
Hi ,
I am execution pyspark on yarn.
I have successfully executed initial dataset but now I growed it 10 times
more.
during execution I got all the time this error:
14/09/17 19:28:50 ERROR cluster.YarnClientClusterScheduler: Lost executor
68 on UCS-NODE1.sms1.local: remote Akka client
How many partitions do you have in your input rdd? Are you specifying
numPartitions in subsequent calls to groupByKey/reduceByKey?
On Sep 17, 2014, at 4:38 AM, Oleg Ruchovets oruchov...@gmail.com wrote:
Hi ,
I am execution pyspark on yarn.
I have successfully executed initial dataset
Sure, I'll post to the mail list.
groupByKey(self, numPartitions=None)source code
http://spark.apache.org/docs/1.0.2/api/python/pyspark.rdd-pysrc.html#RDD.groupByKey
Group the values for each key in the RDD into a single sequence.
Hash-partitions the resulting RDD with into numPartitions
Maybe the Python worker use too much memory during groupByKey(),
groupByKey() with larger numPartitions can help.
Also, can you upgrade your cluster to 1.1? It can spilling the data
into disks if the memory can not hold all the data during groupByKey().
Also, If there is hot key with dozens of