Hi Richard,
I have actually applied the following fix to our 1.4.0 version and this
seem to resolve the zombies :)
https://github.com/apache/spark/pull/7077/files
Sjoerd
2015-06-26 20:08 GMT+02:00 Richard Marscher rmarsc...@localytics.com:
Hi,
we are on 1.3.1 right now so in case there are
Ah I see, glad that simple patch works for your problem. That seems to be a
different underlying problem than we have been experiencing. In our case,
the executors are failing properly, its just that none of the new ones will
ever escape experiencing the same exact issue. So we start a death
We've seen this issue as well in production. We also aren't sure what
causes it, but have just recently shaded some of the Spark code in
TaskSchedulerImpl that we use to effectively bubble up an exception from
Spark instead of zombie in this situation. If you are interested I can go
into more
Hi,
we are on 1.3.1 right now so in case there are differences in the Spark
files I'll walk through the logic of what we did and post a couple gists at
the end. We haven't committed to forking Spark for our own deployments yet,
so right now we shadow some Spark classes in our application code
Hi,
I have a really annoying issue that i cannot replicate consistently, still
it happens every +- 100 submissions. (it's a job that's running every 3
minutes).
Already reported an issue for this:
https://issues.apache.org/jira/browse/SPARK-8592
Here are the Thread dump of the Driver and the