Re: Review Request 23912: Fix MESOS-947: Slave should properly handle a killTask() that arrives between runTask() and _runTask()

Vinod Kone Thu, 07 Aug 2014 18:23:10 -0700


> On Aug. 5, 2014, 9:29 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp, line 1337
> > <https://reviews.apache.org/r/23912/diff/3/?file=643812#file643812line1337>
> >
> >     s/executor is running/it was launched/


doesn't look like this was addressed.


> On Aug. 5, 2014, 9:29 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp, line 1339
> > <https://reviews.apache.org/r/23912/diff/3/?file=643812#file643812line1339>
> >
> >     s/Task killed early/Task killed before it was launched/

doesn't look like this was addressed.


> On Aug. 5, 2014, 9:29 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp, lines 1330-1341
> > <https://reviews.apache.org/r/23912/diff/3/?file=643812#file643812line1330>
> >
> >     I would recommend pulling this logic outside 'if (executor == NULL)' to 
> > #1326 to make it easy to reason about.
> >     
> >     also, task ids are globally unique. there shouldn't be multiple 
> > executors with the same task id. you could actually do a 
> > CHECK(executorIds.size() == 1) to confirm that.
> 
> Vinod Kone wrote:
>     this wasn't addressed either. is there another diff coming?
> 
> Bernd Mathiske wrote:
>     Moving the code outside 'if (executor == NULL)' looks like more control 
> flow to me, not less. There should not be any pending executor if there is an 
> assigned one, which 'if (executor == NULL)' tests. We would then have to 
> remember the result of the check for pending executors and take the same 
> action in case there isn't an assigned one. What am I missing? I am dropping 
> this part of the issue, I suppose.
>     
>     Adding CHECK(executorIds.size() == 1).

framework->getExecutor(taskId) returns NULL if the task is not assigned to the 
executor (i.e., _runTask() is not called yet) __or__ the executor is not 
running. That's what confused me the most. Also, the fact that your log line 
says "..before executor is running" threw me off, because thats not necessarily 
true.

For me, the following flow makes it easy to understand whats happening.

// If the task hasn't been launched yet send TASK_KIILED.
hashset<ExecutorID> executorIds = framework->pending.getKeys(taskId);
if (!executorIds.empty()) {
  CHECK_EQ(1, executorIds.size());
  framework->pending.remove(executorIds.front(), taskId);

  LOG(WARNING) << "Killing task " << taskId
               << " of framework " << frameworkId
               << " before executor is running";

  const StatusUpdate& update = protobuf::createStatusUpdate(
      frameworkId, info.id(), taskId, TASK_KILLED, "Task killed before it was 
launched");

  statusUpdate(update, UPID());
  return;
}

// If the executor doesn't exist, send TASK_LOST.
Executor* executor = framework->getExecutor(taskId);
.....same code as before.....


Let me know if I missed/misunderstood something.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23912/#review49552
-----------------------------------------------------------


On Aug. 6, 2014, 11:51 p.m., Bernd Mathiske wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23912/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2014, 11:51 p.m.)
> 
> 
> Review request for mesos.
> 
> 
> Bugs: MESOS-947
>     https://issues.apache.org/jira/browse/MESOS-947
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixes MESOS-947 "Slave should properly handle a killTask() that arrives 
> between runTask() and _runTask()".
> 
> Slave::killTask() did not check for task in question combination to be 
> "pending" (i.e. Slave::runTask had happened, but Slave::_runTask had not yet) 
> and then erroneously assumed that Slave::runTask() had not been executed. The 
> task was then marked "LOST" instead of "KILLED". But Slave::runTask had 
> already scheduled Slave::_runTask to follow. Now the entry for being 
> "pending" is removed, and the task is marked "KILLED", and _runTask gets 
> informed about this. It checks whether the task in question is currently 
> "pending" and if it is not, then it infers that the task has been killed and 
> does not erroneously try to complete launching it.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp a896bb66db5d8cd27ef02b6498c9db93cb0d525f 
>   src/slave/slave.cpp 1d5691836822c8587e1aa8ed24860a8012c67a6e 
>   src/tests/mesos.hpp 75c66fda2485afa0d4541e710780d90b3411839a 
>   src/tests/mesos.cpp 35c94fa908ad728ea92a7d1bfcbe90d57b1b83d9 
>   src/tests/slave_tests.cpp e45255a6f699e51bf09397da95a5a11edbabe591 
> 
> Diff: https://reviews.apache.org/r/23912/diff/
> 
> 
> Testing
> -------
> 
> Wrote a unit test that reliably created the situation described in the 
> ticket. Observed that TASK_LOST and the listed log output occurred. This 
> pointed directly to the lines in killTask() where the problem is rooted. Ran 
> the test after fixing, it succeeded. Checked the log. It looks like a "clean 
> kill" now :-)
> 
> 
> Thanks,
> 
> Bernd Mathiske
> 
>

Re: Review Request 23912: Fix MESOS-947: Slave should properly handle a killTask() that arrives between runTask() and _runTask()

Reply via email to