> On March 12, 2019, 3:56 p.m., Joseph Wu wrote:
> > src/master/master.cpp
> > Lines 11361-11385 (original), 11378-11402 (patched)
> > <https://reviews.apache.org/r/70116/diff/4/?file=2130965#file2130965line11380>
> >
> >     I'm curious what would be the proper way to handle operation 
> > cleanup/removal.
> >     
> >     When an operation is transitioned into a terminal state, the master 
> > will usually `removeOperation(...)` shortly afterwards.  Since we don't 
> > decrement the metrics in this case, the number of terminal operations will 
> > continue to grow.  This seems like the proper behavior.
> >     
> >     However, in this code, it is possible to remove an agent with 
> > non-terminal operations.  This means the non-terminal metrics will never be 
> > decremented.  So you can have a cluster with 0 operations, but the metric 
> > for pending operations might be non-zero.
> 
> Benno Evers wrote:
>     Hm, good question. I think the only ways a slave gets removed while it 
> still has operations pending is by either being marked gone, or becoming 
> unreachable.
>     
>     In both cases we already transition the counters to the correct 
> `OPERATION_GONE`/`OPERATION_UNREACHABLE` states. (although unfortunately in a 
> somewhat non-local manner, that's what https://reviews.apache.org/r/70185/ is 
> all about)
>     
>     For gone operations, this should be fine. However, the problem is that 
> when an unreachable slave reregisters, we re-add all operations as new 
> operations without decrementing the `operations_unreachable` metric, since at 
> the time the `UpdateSlaveMessage` arrives the master already forgot that the 
> slave was previously unreachable.
>     
>     So as far as I can see, the metrics for pending operations should always 
> be correct, but it is possible to overcount unreachable operations.
>     
>     It's not clear if this can be fixed without quite far-reaching 
> refactoring in the master. So I think the best course might be to either 
> document this behaviour, or remove the `operations_unreachable` metric 
> altogether.
>     
>     What do you think?

It may be worthwhile to add a CHECK to make sure we only ever remove terminal 
(including GONE) or UNREACHABLE operations.

---

Per endless (possible) recounting of unreachable operations, I'd lean towards 
overcounting & documenting how/when this is possible.  Probably inside the 
operator document that lists/describes all the metrics.


- Joseph


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review213632
-----------------------------------------------------------


On March 11, 2019, 12:14 p.m., Benno Evers wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> -----------------------------------------------------------
> 
> (Updated March 11, 2019, 12:14 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
>     https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This commit adds additional metrics counting the
> number of operations in each state.
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp dc68fc324de7242737123015fbac19a2129778ce 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/4/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Benno Evers
> 
>

Reply via email to