I'm running a regular Spark job with a cluster of 50 core instances and 100 executors. The work they are doing appears to be fairly evenly distributed, however often I will see one or two of the executors appear to be doing no work. They are listed as having tasks active and often those become very long running tasks that are sometimes re-driven speculatively to other executors. The timeline graphs shows them spending all that time 'computing'. I checked both the stdout and stderr logs of the executors and couldn't see anything unusual other than they just weren't logging anything after a certain point (they were doing work fine up to that point) and the thread dump didn't really enlighten me as to what might be going on either, they just appeared to be doing normal work.
Interestingly if I didn't change the configuration or source data at all, the same executors seemed to get stuck in the same way in a completely reproducible way. Question 1: Is this a 'known' behavior of Spark? Question 2: Is there an easy way to have the driver detect and restart these stuck executors? Extra information: Job running on an EMR cluster with Spark 1.6 with an r3.2xlarge as driver and 50 i2.2xlarge core instances, 2 executors per instance, 3 cores per executor, reading input data direct from S3. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lazy-executors-tp26271.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org