> On Oct. 26, 2016, 9:43 p.m., Jiang Yan Xu wrote: > > src/master/master.cpp, lines 6036-6037 > > <https://reviews.apache.org/r/53202/diff/1/?file=1546421#file1546421line6036> > > > > Assuming frameworks are not partition-aware based on the agent verison > > doesn't feel right. > > > > Ultimately it doesn't seem to make a difference in terms of messages > > Mesos sends: if the framework is not connected, no update is sent in this > > method. Later when it reconnects and reconciles, Master checks its > > capability and decides: `(5) Task is unknown, slave is unreachable: > > TASK_UNREACHABLE` or `TASK_LOST` if the frameworks is not partition-aware. > > > > Would it make sense to set the state to `TASK_UNREACHABLE` in this > > case? Looks like the only differences it makes are: > > > > - - metrics: Regardless of framework capabilities, the agent is indeed > > unreachable: `TASK_UNREACHABLE` is more in line with the (1.1) master's > > logic and the metrics don't reflect 100% of what the master sends out > > anyways. > > - documentation: we set the state because it makes sense to the master > > and not by guessing the framework's capabilities. also worth-mentioning is > > the fact that this doesn't violate the API semantics: partition-awareness > > is checked at reconciliaton time. > > Neil Conway wrote: > I don't think either option is ideal. I opted for `TASK_LOST` because: > > (a) sending `TASK_LOST` is still the default behavior; > partition-awareness is an "experimental" feature in 1.1, and in any case is > guarded behind a capability. > (b) if you have (very) old agents, you're more likely to have old > frameworks that haven't enabled partition-awareness. > > Jiang Yan Xu wrote: > OK I am not adamant about `TASK_LOST` vs `TASK_UNREACHABLE` as long as > it's not sent out but w.r.t the rationale: > > a) I don't feel this has to do with the feature being "experimental", > imagine we are at 1.2 and we've promoted this feature as stable and we have > marked the following > > ``` > // In Mesos 1.2, this will only be sent when the framework does NOT > // opt-in to the PARTITION_AWARE capability. > TASK_LOST = 5; > ``` > we still have this problem, would we solve it differently? If not, we > might as well don't take "experimental" into consideration here. > > b) this is more about documentation but even if we use `TASK_LOST`, can > we avoid saying we are basing our decision on the likelyhood of old > frameworks because of old agents? i.e., my original comment: *we set the > state because it makes sense to the master and not by guessing the > framework's capabilities. also worth-mentioning is the fact that this doesn't > violate the API semantics: partition-awareness is checked at reconciliaton > time*.
Re: (a), we can only guarantee that `TASK_LOST` will not be sent to partition-aware frameworks if all the agents in the cluster have been upgraded to Mesos >= 1.2 -- otherwise, an old agent might generate a `TASK_LOST` status update directly (e.g., due to an agent-local task launch failure). Re: (b) sure, that was only a minor point -- but I think we should err on the side of not breaking old frameworks if we can't know the right thing to do, at least for 1.1. - Neil ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53202/#review153939 ----------------------------------------------------------- On Oct. 26, 2016, 7:51 p.m., Neil Conway wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/53202/ > ----------------------------------------------------------- > > (Updated Oct. 26, 2016, 7:51 p.m.) > > > Review request for mesos, Vinod Kone and Jiang Yan Xu. > > > Bugs: MESOS-6483 > https://issues.apache.org/jira/browse/MESOS-6483 > > > Repository: mesos > > > Description > ------- > > We don't guarantee compatibility with pre-1.0 agents. However, since it > is easy to avoid a CHECK failure in the master when an old agent > re-registers, it seems worth doing so. > > > Diffs > ----- > > src/master/master.cpp 23ddb995b4ad0fcdb589974308a2e81ececdad31 > > Diff: https://reviews.apache.org/r/53202/diff/ > > > Testing > ------- > > `make check` > > Disabled the code that fills-in `frameworks.recovered`; verified that > `PartitionTest.DisconnectedFramework` dies with a `CHECK` failure if this RR > is not applied but passes this with RR applied. > > > Thanks, > > Neil Conway > >