> On March 12, 2019, 3:56 p.m., Joseph Wu wrote: > > src/master/master.cpp > > Lines 11361-11385 (original), 11378-11402 (patched) > > <https://reviews.apache.org/r/70116/diff/4/?file=2130965#file2130965line11380> > > > > I'm curious what would be the proper way to handle operation > > cleanup/removal. > > > > When an operation is transitioned into a terminal state, the master > > will usually `removeOperation(...)` shortly afterwards. Since we don't > > decrement the metrics in this case, the number of terminal operations will > > continue to grow. This seems like the proper behavior. > > > > However, in this code, it is possible to remove an agent with > > non-terminal operations. This means the non-terminal metrics will never be > > decremented. So you can have a cluster with 0 operations, but the metric > > for pending operations might be non-zero. > > Benno Evers wrote: > Hm, good question. I think the only ways a slave gets removed while it > still has operations pending is by either being marked gone, or becoming > unreachable. > > In both cases we already transition the counters to the correct > `OPERATION_GONE`/`OPERATION_UNREACHABLE` states. (although unfortunately in a > somewhat non-local manner, that's what https://reviews.apache.org/r/70185/ is > all about) > > For gone operations, this should be fine. However, the problem is that > when an unreachable slave reregisters, we re-add all operations as new > operations without decrementing the `operations_unreachable` metric, since at > the time the `UpdateSlaveMessage` arrives the master already forgot that the > slave was previously unreachable. > > So as far as I can see, the metrics for pending operations should always > be correct, but it is possible to overcount unreachable operations. > > It's not clear if this can be fixed without quite far-reaching > refactoring in the master. So I think the best course might be to either > document this behaviour, or remove the `operations_unreachable` metric > altogether. > > What do you think?
It may be worthwhile to add a CHECK to make sure we only ever remove terminal (including GONE) or UNREACHABLE operations. --- Per endless (possible) recounting of unreachable operations, I'd lean towards overcounting & documenting how/when this is possible. Probably inside the operator document that lists/describes all the metrics. - Joseph ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70116/#review213632 ----------------------------------------------------------- On March 11, 2019, 12:14 p.m., Benno Evers wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70116/ > ----------------------------------------------------------- > > (Updated March 11, 2019, 12:14 p.m.) > > > Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu. > > > Bugs: MESOS-8241 > https://issues.apache.org/jira/browse/MESOS-8241 > > > Repository: mesos > > > Description > ------- > > This commit adds additional metrics counting the > number of operations in each state. > > Unit tests are added in the subsequent commit. > > > Diffs > ----- > > src/master/master.cpp dc68fc324de7242737123015fbac19a2129778ce > src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa > src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 > > > Diff: https://reviews.apache.org/r/70116/diff/4/ > > > Testing > ------- > > > Thanks, > > Benno Evers > >