Lazy executors

Bemaze Thu, 18 Feb 2016 12:57:56 -0800

I'm running a regular Spark job with a cluster of 50 core instances and 100
executors. The work they are doing appears to be fairly evenly distributed,
however often I will see one or two of the executors appear to be doing no
work. They are listed as having tasks active and often those become very
long running tasks that are sometimes re-driven speculatively to other
executors. The timeline graphs shows them spending all that time
'computing'. I checked both the stdout and stderr logs of the executors and
couldn't see anything unusual other than they just weren't logging anything
after a certain point (they were doing work fine up to that point) and the
thread dump didn't really enlighten me as to what might be going on either,
they just appeared to be doing normal work.


Interestingly if I didn't change the configuration or source data at all,
the same executors seemed to get stuck in the same way in a completely
reproducible way.

Question 1: Is this a 'known' behavior of Spark?

Question 2: Is there an easy way to have the driver detect and restart these
stuck executors?

Extra information: Job running on an EMR cluster with Spark 1.6 with an
r3.2xlarge as driver and 50 i2.2xlarge core instances, 2 executors per
instance, 3 cores per executor, reading input data direct from S3.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lazy-executors-tp26271.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Lazy executors

Reply via email to