Just testing spark v1.5.0 (on mesos v0.23) and we saw something
unexpected (according to the event timeline) - when a spark task failed
(intermittent S3 connection failure), the whole executor was removed and
was never recovered so the job proceeded slower than normal.
Looking at the code I saw something that seemed a little odd in
core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala:
override def statusUpdate(d: SchedulerDriver, status: TaskStatus) {
...
if (TaskState.isFailed(TaskState.fromMesos(status.getState))
&& taskIdToSlaveId.contains(tid)) {
// We lost the executor on this slave, so remember that it's gone
removeExecutor(taskIdToSlaveId(tid), "Lost executor")
}
if (TaskState.isFinished(state)) {
taskIdToSlaveId.remove(tid)
}
}
I don't know either codebase at all, however it seems odd to kill the
executor for a failed task rather than just a lost task. I did a quick
test (with v1.5.1) where I replaced this line with:
if ((TaskState.fromMesos(status.getState) == TaskState.LOST)
and all seemed well - I faked the problem (using iptables to briefly
block access to the S3 endpoint), the task failed but was retried (on
the same executor), succeeded and continued on its merry way.
Adrian
--
*Adrian Bridgett* | Sysadmin Engineer, OpenSignal
<http://www.opensignal.com>
_____________________________________________________
Office: First Floor, Scriptor Court, 155-157 Farringdon Road,
Clerkenwell, London, EC1R 3AD
Phone #: +44 777-377-8251
Skype: abridgett |@adrianbridgett <http://twitter.com/adrianbridgett>|
LinkedIn link <https://uk.linkedin.com/in/abridgett>
_____________________________________________________