Just testing spark v1.5.0 (on mesos v0.23) and we saw something unexpected (according to the event timeline) - when a spark task failed (intermittent S3 connection failure), the whole executor was removed and was never recovered so the job proceeded slower than normal.

Looking at the code I saw something that seemed a little odd in core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala:

  override def statusUpdate(d: SchedulerDriver, status: TaskStatus) {
...
        if (TaskState.isFailed(TaskState.fromMesos(status.getState))
          && taskIdToSlaveId.contains(tid)) {
          // We lost the executor on this slave, so remember that it's gone
          removeExecutor(taskIdToSlaveId(tid), "Lost executor")
        }
        if (TaskState.isFinished(state)) {
          taskIdToSlaveId.remove(tid)
        }
      }

I don't know either codebase at all, however it seems odd to kill the executor for a failed task rather than just a lost task. I did a quick test (with v1.5.1) where I replaced this line with:

   if ((TaskState.fromMesos(status.getState) == TaskState.LOST)

and all seemed well - I faked the problem (using iptables to briefly block access to the S3 endpoint), the task failed but was retried (on the same executor), succeeded and continued on its merry way.

Adrian
--
*Adrian Bridgett* | Sysadmin Engineer, OpenSignal <http://www.opensignal.com>
_____________________________________________________
Office: First Floor, Scriptor Court, 155-157 Farringdon Road, Clerkenwell, London, EC1R 3AD
Phone #: +44 777-377-8251
Skype: abridgett |@adrianbridgett <http://twitter.com/adrianbridgett>| LinkedIn link <https://uk.linkedin.com/in/abridgett>
_____________________________________________________

Reply via email to