Re: Review Request 23912: Fix MESOS-947: Slave should properly handle a killTask() that arrives between runTask() and _runTask()

Bernd Mathiske Fri, 17 Oct 2014 09:24:36 -0700


> On Oct. 10, 2014, 11:24 a.m., Vinod Kone wrote:
> > src/slave/slave.cpp, line 1413
> > <https://reviews.apache.org/r/23912/diff/8/?file=716689#file716689line1413>
> >
> >     It is weird to me that you remove the task here but (potentially) 
> > remove the executor up in _runTask(). It's not obvious to me why you made 
> > that choice. If there is a good reason, please add a comment here.
> 
> Bernd Mathiske wrote:
>     The task is removed in killTask(), because later on the information that 
> it should be removed is no longer easily at hand.
>     
>     I looked at lines 1194, 1195 in _runTask():
>       Framework* framework = getFramework(frameworkId);
>       CHECK_NOTNULL(framework);
>       
>     If I removed the framework earlier than this, this check would fire. If I 
> want to avoid that, I need to write extra code (e.g. prevent _runTask() from 
> happening). Assembling all framework removals in the second half of 
> _runTask() looks like good defensive practice to me.
>     
>     There is no harm in removing the framework in _runTask() vs. killTask() 
> since _runTask() must eventually happen. But by waiting with removing the 
> framework until _runTask(), we do not need to think about what implications 
> concurrent starting and killing of multiple tasks with the same executor 
> might have.
> 
> Vinod Kone wrote:
>     I don't think I follow what you mean by
>     
>     "The task is removed in killTask(), because later on the information that 
> it should be removed is no longer easily at hand"
>     
>     and 
>     
>     "...we do not need to think about what implications concurrent starting 
> and killing of multiple tasks with the same executor might have"
>     
>     
>     Anyway, here's why I think we should remove the framework in killTask().
>     
>     It doesn't require non-local reasoning. Right now, someone reading the 
> killTask() method should know that a pending task means a "_runTask()" 
> dispatch is in the queue and that the executor and framework will be removed 
> in '_runTask()'. Instead if killTask() can remove the framework right away, 
> it's easy to understand.
>     
>     It is much simple for methods that take "FrameworkID" (instead of 
> Framework*) to check if the framework exists. If the framework doesn't exist, 
> for whatever reason, it could just bail.
>     
>     I think these semantics would make future contributors' life easy when 
> making changes to these methods.
>     
>     Changing the CHECK in '_runTask()' to an if condition doesn't seem too 
> bad to me.
>     
>     ```
>     _runTask()
>     {
>        if (framework == NULL) {
>          LOG(WARNING) << "Ignoring run task " << task........
>                       << " because the framework does not exist";
>                       
>          return;
>        }
>     }
>     ```
>     
>     Makes sense?
> 
> Bernd Mathiske wrote:
>     This is more confusing for the user. He will then ask himself "why does 
> the framework not exist"? And then have to track back in the log to find that 
> the task has been killed. Making it simpler for users trumps making it 
> simpler for us Mesos developers IMHO.


I tried your approach and it worked after moving the statusUpdate in killTask 
ahead of removing the framework.


- Bernd


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23912/#review56016
-----------------------------------------------------------


On Oct. 15, 2014, 1:06 a.m., Bernd Mathiske wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23912/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2014, 1:06 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-947
>     https://issues.apache.org/jira/browse/MESOS-947
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixes MESOS-947 "Slave should properly handle a killTask() that arrives 
> between runTask() and _runTask()".
> 
> Slave::killTask() did not check for task in question combination to be 
> "pending" (i.e. Slave::runTask had happened, but Slave::_runTask had not yet) 
> and then erroneously assumed that Slave::runTask() had not been executed. The 
> task was then marked "LOST" instead of "KILLED". But Slave::runTask had 
> already scheduled Slave::_runTask to follow. Now the entry for being 
> "pending" is removed, and the task is marked "KILLED", and _runTask gets 
> informed about this. It checks whether the task in question is currently 
> "pending" and if it is not, then it infers that the task has been killed and 
> does not erroneously try to complete launching it.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 342b09fc084c20d98d096bb129830440179c092c 
>   src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
>   src/tests/mesos.hpp 957e2233cc11c438fd80d3b6d1907a1223093104 
>   src/tests/mesos.cpp 3dcb2acd5ad4ab5e3a7b4fe524ee077558112773 
>   src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 
> 
> Diff: https://reviews.apache.org/r/23912/diff/
> 
> 
> Testing
> -------
> 
> Wrote a unit test that reliably created the situation described in the 
> ticket. Observed that TASK_LOST and the listed log output occurred. This 
> pointed directly to the lines in killTask() where the problem is rooted. Ran 
> the test after fixing, it succeeded. Checked the log. It looks like a "clean 
> kill" now :-)
> 
> 
> Thanks,
> 
> Bernd Mathiske
> 
>

Re: Review Request 23912: Fix MESOS-947: Slave should properly handle a killTask() that arrives between runTask() and _runTask()

Reply via email to