> On Feb. 15, 2018, 11:55 p.m., Greg Mann wrote:
> > src/master/master.cpp
> > Lines 7647-7654 (patched)
> > <https://reviews.apache.org/r/65482/diff/3/?file=1961521#file1961521line7647>
> >
> >     I'm sitting here trying to think of ways we might avoid crashing if the 
> > framework subscribes before the operation becomes terminal...
> >     
> >     Would it be reasonable to add an `if (framework == nullptr)` check to 
> > `updateOperation()` so that we only recover resources if the framework is 
> > known to the master?
> 
> Greg Mann wrote:
>     Er... wait that doesn't make sense :) I guess when we receive the 
> operation update, we have no way of knowing whether or not the framework had 
> subscribed when the master learned about the pending operation. As a 
> workaround for now, we could store in a set the operation UUIDs of operations 
> for which we do not track allocated resources (i.e., operations which hit 
> this block of code). Then, in `updateOperation` we could avoid recovering 
> resources if the operation's UUID is in the set?
> 
> Benjamin Bannier wrote:
>     No matter what we do here, we will already have entered dangerous 
> territory with `CHECK` failures looming as soon as we added such an operation 
> to master state, since we cannot update the allocator on these used resources 
> (update the allocation to reflect operation results, or recover). We might 
> also end up unknowingly oversubscribing the resources used by the operation.
>     
>     I am unsure whether working around this by e.g., no updating the 
> allocator when the operation becomes terminal is a good way forward since it 
> seems to delay the needed master failover for even longer. With the approach 
> I proposed we would failover when this operation gets terminal (i.e., when we 
> can safely reconcile this particular operation).
>     
>     I added the framework ID to the log output since it might be useful for 
> operators debugging such scenarios.

I think I agree - let's merge this improvement for now and try to resolve 
MESOS-8582 for Mesos 1.6.


- Greg


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65482/#review197641
-----------------------------------------------------------


On Feb. 16, 2018, 2:12 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65482/
> -----------------------------------------------------------
> 
> (Updated Feb. 16, 2018, 2:12 p.m.)
> 
> 
> Review request for mesos, Greg Mann, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-8536
>     https://issues.apache.org/jira/browse/MESOS-8536
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch fixes the handling of non-terminal operations learned by a
> newly elected master after a master failover, so that only these
> operations are counted as using resources. Previously we did not count
> any operations as using resources which by accident produced expected
> behavior if the operation was already terminal when the master learned
> about them.
> 
> We do not address the issue of being unable to properly account for
> operations triggered by frameworks unknown to the master, see
> MESOS-8582. Instead we emit a warning for now since the master might
> continue to abort due to assertion failures due to incomplete resource
> accounting.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp b06d7a6e2fbbb81b97eaf537d5b6745c73dc867d 
> 
> 
> Diff: https://reviews.apache.org/r/65482/diff/4/
> 
> 
> Testing
> -------
> 
> `make check`, also tested with a version of the test added in r/65045 which 
> triggered this issue.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>

Reply via email to