Re: Review Request 53202: Avoided CHECK failure with pre-1.0 agents.

Jiang Yan Xu Thu, 27 Oct 2016 11:25:40 -0700


> On Oct. 26, 2016, 2:43 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp, lines 6036-6037
> > <https://reviews.apache.org/r/53202/diff/1/?file=1546421#file1546421line6036>
> >
> >     Assuming frameworks are not partition-aware based on the agent verison 
> > doesn't feel right.
> >     
> >     Ultimately it doesn't seem to make a difference in terms of messages 
> > Mesos sends: if the framework is not connected, no update is sent in this 
> > method. Later when it reconnects and reconciles, Master checks its 
> > capability and decides: `(5) Task is unknown, slave is unreachable: 
> > TASK_UNREACHABLE` or `TASK_LOST` if the frameworks is not partition-aware.
> >     
> >     Would it make sense to set the state to `TASK_UNREACHABLE` in this 
> > case? Looks like the only differences it makes are:
> >     
> >     - - metrics: Regardless of framework capabilities, the agent is indeed 
> > unreachable: `TASK_UNREACHABLE` is more in line with the (1.1) master's 
> > logic and the metrics don't reflect 100% of what the master sends out 
> > anyways.
> >     - documentation: we set the state because it makes sense to the master 
> > and not by guessing the framework's capabilities. also worth-mentioning is 
> > the fact that this doesn't violate the API semantics: partition-awareness 
> > is checked at reconciliaton time.
> 
> Neil Conway wrote:
>     I don't think either option is ideal. I opted for `TASK_LOST` because:
>     
>     (a) sending `TASK_LOST` is still the default behavior; 
> partition-awareness is an "experimental" feature in 1.1, and in any case is 
> guarded behind a capability.
>     (b) if you have (very) old agents, you're more likely to have old 
> frameworks that haven't enabled partition-awareness.
> 
> Jiang Yan Xu wrote:
>     OK I am not adamant about `TASK_LOST` vs `TASK_UNREACHABLE` as long as 
> it's not sent out but w.r.t the rationale:
>     
>     a) I don't feel this has to do with the feature being "experimental", 
> imagine we are at 1.2 and we've promoted this feature as stable and we have 
> marked the following
>     
>     ```
>       // In Mesos 1.2, this will only be sent when the framework does NOT
>       // opt-in to the PARTITION_AWARE capability.
>       TASK_LOST = 5;
>     ```
>     we still have this problem, would we solve it differently? If not, we 
> might as well don't take "experimental" into consideration here.
>     
>     b) this is more about documentation but even if we use `TASK_LOST`, can 
> we avoid saying we are basing our decision on the likelyhood of old 
> frameworks because of old agents? i.e., my original comment: *we set the 
> state because it makes sense to the master and not by guessing the 
> framework's capabilities. also worth-mentioning is the fact that this doesn't 
> violate the API semantics: partition-awareness is checked at reconciliaton 
> time*.
> 
> Neil Conway wrote:
>     Re: (a), we can only guarantee that `TASK_LOST` will not be sent to 
> partition-aware frameworks if all the agents in the cluster have been 
> upgraded to Mesos >= 1.2 -- otherwise, an old agent might generate a 
> `TASK_LOST` status update directly (e.g., due to an agent-local task launch 
> failure).
>     
>     Re: (b) sure, that was only a minor point -- but I think we should err on 
> the side of not breaking old frameworks if we can't know the right thing to 
> do, at least for 1.1.


a) Ok I only just realized this now. Thanks for the clarification!
b) True, we should err on the side of not breaking old frameworks but here we 
don't break them either way. Can we update the comment to not say "assuming the 
framework isn't partition-aware" but rather something like "because TASK_LOST 
is still the default state (as of Mesos 1.1) in this situation, even if the 
disconnected framework is partition-aware"?

I just think the decision here is really made but by assuming the framework is 
not partition-aware.


- Jiang Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53202/#review153939
-----------------------------------------------------------


On Oct. 27, 2016, 10:16 a.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53202/
> -----------------------------------------------------------
> 
> (Updated Oct. 27, 2016, 10:16 a.m.)
> 
> 
> Review request for mesos, Vinod Kone and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6483
>     https://issues.apache.org/jira/browse/MESOS-6483
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> We don't guarantee compatibility with pre-1.0 agents. However, since it
> is easy to avoid a CHECK failure in the master when an old agent
> re-registers, it seems worth doing so.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 23ddb995b4ad0fcdb589974308a2e81ececdad31 
> 
> Diff: https://reviews.apache.org/r/53202/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Disabled the code that fills-in `frameworks.recovered`; verified that 
> `PartitionTest.DisconnectedFramework` dies with a `CHECK` failure if this RR 
> is not applied but passes this with RR applied.
> 
> 
> Thanks,
> 
> Neil Conway
> 
>

Re: Review Request 53202: Avoided CHECK failure with pre-1.0 agents.

Reply via email to