failed mesos task loses executor

Adrian Bridgett Mon, 19 Oct 2015 05:43:27 -0700

Just testing spark v1.5.0 (on mesos v0.23) and we saw somethingunexpected (according to the event timeline) - when a spark task failed(intermittent S3 connection failure), the whole executor was removed andwas never recovered so the job proceeded slower than normal.

Looking at the code I saw something that seemed a little odd incore/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala:


  override def statusUpdate(d: SchedulerDriver, status: TaskStatus) {
...
        if (TaskState.isFailed(TaskState.fromMesos(status.getState))
          && taskIdToSlaveId.contains(tid)) {
          // We lost the executor on this slave, so remember that it's gone
          removeExecutor(taskIdToSlaveId(tid), "Lost executor")
        }
        if (TaskState.isFinished(state)) {
          taskIdToSlaveId.remove(tid)
        }
      }

I don't know either codebase at all, however it seems odd to kill theexecutor for a failed task rather than just a lost task. I did a quicktest (with v1.5.1) where I replaced this line with:


   if ((TaskState.fromMesos(status.getState) == TaskState.LOST)

and all seemed well - I faked the problem (using iptables to brieflyblock access to the S3 endpoint), the task failed but was retried (onthe same executor), succeeded and continued on its merry way.


Adrian
--

*Adrian Bridgett* | Sysadmin Engineer, OpenSignal<http://www.opensignal.com>

_____________________________________________________

Office: First Floor, Scriptor Court, 155-157 Farringdon Road,Clerkenwell, London, EC1R 3AD

Phone #: +44 777-377-8251

Skype: abridgett |@adrianbridgett <http://twitter.com/adrianbridgett>|LinkedIn link <https://uk.linkedin.com/in/abridgett>

_____________________________________________________

failed mesos task loses executor

Reply via email to