> On Feb. 15, 2018, 11:55 p.m., Greg Mann wrote: > > src/master/master.cpp > > Lines 7647-7654 (patched) > > <https://reviews.apache.org/r/65482/diff/3/?file=1961521#file1961521line7647> > > > > I'm sitting here trying to think of ways we might avoid crashing if the > > framework subscribes before the operation becomes terminal... > > > > Would it be reasonable to add an `if (framework == nullptr)` check to > > `updateOperation()` so that we only recover resources if the framework is > > known to the master? > > Greg Mann wrote: > Er... wait that doesn't make sense :) I guess when we receive the > operation update, we have no way of knowing whether or not the framework had > subscribed when the master learned about the pending operation. As a > workaround for now, we could store in a set the operation UUIDs of operations > for which we do not track allocated resources (i.e., operations which hit > this block of code). Then, in `updateOperation` we could avoid recovering > resources if the operation's UUID is in the set? > > Benjamin Bannier wrote: > No matter what we do here, we will already have entered dangerous > territory with `CHECK` failures looming as soon as we added such an operation > to master state, since we cannot update the allocator on these used resources > (update the allocation to reflect operation results, or recover). We might > also end up unknowingly oversubscribing the resources used by the operation. > > I am unsure whether working around this by e.g., no updating the > allocator when the operation becomes terminal is a good way forward since it > seems to delay the needed master failover for even longer. With the approach > I proposed we would failover when this operation gets terminal (i.e., when we > can safely reconcile this particular operation). > > I added the framework ID to the log output since it might be useful for > operators debugging such scenarios.
I think I agree - let's merge this improvement for now and try to resolve MESOS-8582 for Mesos 1.6. - Greg ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65482/#review197641 ----------------------------------------------------------- On Feb. 16, 2018, 2:12 p.m., Benjamin Bannier wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/65482/ > ----------------------------------------------------------- > > (Updated Feb. 16, 2018, 2:12 p.m.) > > > Review request for mesos, Greg Mann, Jie Yu, and Jan Schlicht. > > > Bugs: MESOS-8536 > https://issues.apache.org/jira/browse/MESOS-8536 > > > Repository: mesos > > > Description > ------- > > This patch fixes the handling of non-terminal operations learned by a > newly elected master after a master failover, so that only these > operations are counted as using resources. Previously we did not count > any operations as using resources which by accident produced expected > behavior if the operation was already terminal when the master learned > about them. > > We do not address the issue of being unable to properly account for > operations triggered by frameworks unknown to the master, see > MESOS-8582. Instead we emit a warning for now since the master might > continue to abort due to assertion failures due to incomplete resource > accounting. > > > Diffs > ----- > > src/master/master.cpp b06d7a6e2fbbb81b97eaf537d5b6745c73dc867d > > > Diff: https://reviews.apache.org/r/65482/diff/4/ > > > Testing > ------- > > `make check`, also tested with a version of the test added in r/65045 which > triggered this issue. > > > Thanks, > > Benjamin Bannier > >