Review Request 70116: Added metrics for offer operation feedback.

2019-03-04 Thread Benno Evers

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/
---

Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.


Bugs: MESOS-8241
https://issues.apache.org/jira/browse/MESOS-8241


Repository: mesos


Description
---

This commit adds the following additional metrics
to the master:

- master/operations_pending
- master/operations_recovering
- master/operations_finished
- master/operations_failed
- master/operations_error
- master/operations_dropped
- master/operations_unreachable
- master/operations_gone_by_operator

Unit tests are added in the subsequent commit.


Diffs
-

  src/master/master.hpp aceab3455adfdf7ff6e168e033316997a4d36fb5 
  src/master/master.cpp 665c1c7979a5ae4ecec0d5a68e59c5419049a4d5 
  src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
  src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 


Diff: https://reviews.apache.org/r/70116/diff/1/


Testing
---


Thanks,

Benno Evers



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-04 Thread Greg Mann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review213404
---




src/master/master.cpp
Lines 2261 (patched)


Why not use `updateOperationMetrics(OPERATION_ERROR, 1);` here?



src/master/master.cpp
Lines 9223-9224 (patched)


s/agent/master/

Yea I think we can crash here.



src/master/master.cpp
Lines 11672-11673 (patched)


In the case of operation status update retries, this will be unnecessary 
work (i.e. we will be decrementing and then incrementing the same metric). 
Perhaps we should enclose this in a conditional which checks for that equality? 
It will look a bit strange next to the below conditional, but as the comment 
indicates we need to rethink this deduplication:
```
  // TODO(gkleiman): Revisit the de-duplication logic (MESOS-8441) - if two
  // different terminal statuses arrive, we could end up with different 
states
  // in `latest_status` and the front of statuses list.
  if (operation->statuses().empty() ||
  *(operation->statuses().rbegin()) != status) {
operation->add_statuses()->CopyFrom(status);
  }
```



src/master/metrics.hpp
Lines 63-67 (patched)


Is this comment accurate? Looks like there is indeed an 
`operations_unreachable` metric below.

I think we could probably skip the metrics for such states as the comment 
suggests.


- Greg Mann


On March 4, 2019, 5:01 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> ---
> 
> (Updated March 4, 2019, 5:01 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
> https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This commit adds the following additional metrics
> to the master:
> 
> - master/operations_pending
> - master/operations_recovering
> - master/operations_finished
> - master/operations_failed
> - master/operations_error
> - master/operations_dropped
> - master/operations_unreachable
> - master/operations_gone_by_operator
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp aceab3455adfdf7ff6e168e033316997a4d36fb5 
>   src/master/master.cpp 665c1c7979a5ae4ecec0d5a68e59c5419049a4d5 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/1/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-04 Thread Greg Mann


> On March 4, 2019, 10:20 p.m., Greg Mann wrote:
> > src/master/metrics.hpp
> > Lines 63-67 (patched)
> > 
> >
> > Is this comment accurate? Looks like there is indeed an 
> > `operations_unreachable` metric below.
> > 
> > I think we could probably skip the metrics for such states as the 
> > comment suggests.

I think that `operations_recovering`, `operations_unreachable`, and 
`operations_gone_by_operator` could be removed, since those metrics would not 
be accurate after a master failover (i.e., master won't know how many 
operations were on an agent which was marked GONE after master failover). I 
think the other metrics still make sense.


- Greg


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review213404
---


On March 4, 2019, 5:01 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> ---
> 
> (Updated March 4, 2019, 5:01 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
> https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This commit adds the following additional metrics
> to the master:
> 
> - master/operations_pending
> - master/operations_recovering
> - master/operations_finished
> - master/operations_failed
> - master/operations_error
> - master/operations_dropped
> - master/operations_unreachable
> - master/operations_gone_by_operator
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp aceab3455adfdf7ff6e168e033316997a4d36fb5 
>   src/master/master.cpp 665c1c7979a5ae4ecec0d5a68e59c5419049a4d5 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/1/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-07 Thread Benno Evers

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/
---

(Updated March 7, 2019, 5:01 p.m.)


Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.


Changes
---

Added per-operation metrics; addressed comments.


Bugs: MESOS-8241
https://issues.apache.org/jira/browse/MESOS-8241


Repository: mesos


Description (updated)
---

This commit adds additional metrics counting the
number of operations in each state.

Unit tests are added in the subsequent commit.


Diffs (updated)
-

  src/master/master.hpp aceab3455adfdf7ff6e168e033316997a4d36fb5 
  src/master/master.cpp 665c1c7979a5ae4ecec0d5a68e59c5419049a4d5 
  src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
  src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 


Diff: https://reviews.apache.org/r/70116/diff/2/

Changes: https://reviews.apache.org/r/70116/diff/1-2/


Testing
---


Thanks,

Benno Evers



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-07 Thread Gastón Kleiman

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review213541
---




src/master/master.hpp
Lines 941-944 (patched)


Is this used anywhere?



src/master/metrics.hpp
Lines 83 (patched)


s/per operation/per operation type/



src/master/metrics.cpp
Lines 158-159 (original), 160-161 (patched)


Should we add a similar metric for operation status updates?



src/master/metrics.cpp
Lines 174-177 (original), 176-179 (patched)


Should we add a similar metrics for operation status updates?



src/master/metrics.cpp
Lines 386-388 (patched)


This is weirdly formatted, I think the following follows the style guide 
and is more readable:

```
if (type == Offer::Operation::UNKNOWN ||
type == Offer::Operation::LAUNCH ||
type == Offer::Operation::LAUNCH_GROUP) {
```



src/master/metrics.cpp
Lines 637-651 (patched)


Can these counters be decremented? If not, should we add a CHECK to ensure 
that `delta` is always positive?



src/master/metrics.cpp
Lines 684 (patched)


Can counters for terminal states be decremented? I guess not? If I'm 
correct, then we could add a `CHECK` here.


- Gastón Kleiman


On March 7, 2019, 9:01 a.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> ---
> 
> (Updated March 7, 2019, 9:01 a.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
> https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This commit adds additional metrics counting the
> number of operations in each state.
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp aceab3455adfdf7ff6e168e033316997a4d36fb5 
>   src/master/master.cpp 665c1c7979a5ae4ecec0d5a68e59c5419049a4d5 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/2/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-08 Thread Benno Evers

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/
---

(Updated March 8, 2019, 11:48 p.m.)


Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.


Bugs: MESOS-8241
https://issues.apache.org/jira/browse/MESOS-8241


Repository: mesos


Description
---

This commit adds additional metrics counting the
number of operations in each state.

Unit tests are added in the subsequent commit.


Diffs (updated)
-

  src/master/master.cpp dc68fc324de7242737123015fbac19a2129778ce 
  src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
  src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 


Diff: https://reviews.apache.org/r/70116/diff/3/

Changes: https://reviews.apache.org/r/70116/diff/2-3/


Testing
---


Thanks,

Benno Evers



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-08 Thread Benno Evers


> On March 8, 2019, 12:30 a.m., Gastón Kleiman wrote:
> > src/master/master.hpp
> > Lines 941-944 (patched)
> > 
> >
> > Is this used anywhere?

It is not. Removed!


> On March 8, 2019, 12:30 a.m., Gastón Kleiman wrote:
> > src/master/metrics.cpp
> > Lines 158-159 (original), 160-161 (patched)
> > 
> >
> > Should we add a similar metric for operation status updates?

Sounds reasonable.


> On March 8, 2019, 12:30 a.m., Gastón Kleiman wrote:
> > src/master/metrics.cpp
> > Lines 174-177 (original), 176-179 (patched)
> > 
> >
> > Should we add a similar metrics for operation status updates?

Sounds equally reasonable.


> On March 8, 2019, 12:30 a.m., Gastón Kleiman wrote:
> > src/master/metrics.cpp
> > Lines 386-388 (patched)
> > 
> >
> > This is weirdly formatted, I think the following follows the style 
> > guide and is more readable:
> > 
> > ```
> > if (type == Offer::Operation::UNKNOWN ||
> > type == Offer::Operation::LAUNCH ||
> > type == Offer::Operation::LAUNCH_GROUP) {
> > ```

Both styles are explicitly permitted by the style guide, as long as the same 
style is used consistently within a given boolean expression.

But since you feel this one looks better I will change it.


> On March 8, 2019, 12:30 a.m., Gastón Kleiman wrote:
> > src/master/metrics.cpp
> > Lines 637-651 (patched)
> > 
> >
> > Can these counters be decremented? If not, should we add a CHECK to 
> > ensure that `delta` is always positive?

They can be decremented, and we expect delta to be sometimes negative in this 
function. I think the only functional difference between `Counter` and 
`PushGauge` is the existence of `operator=`.


> On March 8, 2019, 12:30 a.m., Gastón Kleiman wrote:
> > src/master/metrics.cpp
> > Lines 684 (patched)
> > 
> >
> > Can counters for terminal states be decremented? I guess not? If I'm 
> > correct, then we could add a `CHECK` here.

They can be decremented, but they should not be unless we have a bug in the 
master code. I'm still a bit hesitant to `CHECK` for this, please see my 
comment to a similar suggestion by Greg above.


- Benno


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review213541
---


On March 8, 2019, 11:48 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> ---
> 
> (Updated March 8, 2019, 11:48 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
> https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This commit adds additional metrics counting the
> number of operations in each state.
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp dc68fc324de7242737123015fbac19a2129778ce 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/3/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-08 Thread Benno Evers


> On March 4, 2019, 10:20 p.m., Greg Mann wrote:
> > src/master/master.cpp
> > Lines 2261 (patched)
> > 
> >
> > Why not use `updateOperationMetrics(OPERATION_ERROR, 1);` here?

Why not, indeed.


> On March 4, 2019, 10:20 p.m., Greg Mann wrote:
> > src/master/master.cpp
> > Lines 9223-9224 (patched)
> > 
> >
> > s/agent/master/
> > 
> > Yea I think we can crash here.

I though about this some more, and I'm actually leaning towards the opposite 
conclusion: Left like this, the worst case outcome is that we'll miss a bug 
that result in incorrect counter values being displayed. When we crash, the 
worst case is that a bug in the metrics accounting code might lead to a 
unrecoverable, crash-looping master.

So I think avoiding the latter scenario might be worth departing from the 
precedence here, especially since we already have precedence for masters 
actually failing to start for "non-critical" reasons.

I'm not insisting, though, if you still think having a `CHECK()` would be 
better I'm happy to add it. (the same comment applies almost verbatim to the 
`CHECK()` suggested by Gaston below as well)


- Benno


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review213404
---


On March 8, 2019, 11:48 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> ---
> 
> (Updated March 8, 2019, 11:48 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
> https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This commit adds additional metrics counting the
> number of operations in each state.
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp dc68fc324de7242737123015fbac19a2129778ce 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/3/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-11 Thread Benno Evers

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/
---

(Updated March 11, 2019, 7:14 p.m.)


Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.


Changes
---

Correctly adjust counter for gone/unreachable operations; add API for decrement 
followed by increment


Bugs: MESOS-8241
https://issues.apache.org/jira/browse/MESOS-8241


Repository: mesos


Description
---

This commit adds additional metrics counting the
number of operations in each state.

Unit tests are added in the subsequent commit.


Diffs (updated)
-

  src/master/master.cpp dc68fc324de7242737123015fbac19a2129778ce 
  src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
  src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 


Diff: https://reviews.apache.org/r/70116/diff/4/

Changes: https://reviews.apache.org/r/70116/diff/3-4/


Testing
---


Thanks,

Benno Evers



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-12 Thread Joseph Wu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review213632
---



One discussion point, but otherwise LGTM.


src/master/master.cpp
Lines 11361-11385 (original), 11378-11402 (patched)


I'm curious what would be the proper way to handle operation 
cleanup/removal.

When an operation is transitioned into a terminal state, the master will 
usually `removeOperation(...)` shortly afterwards.  Since we don't decrement 
the metrics in this case, the number of terminal operations will continue to 
grow.  This seems like the proper behavior.

However, in this code, it is possible to remove an agent with non-terminal 
operations.  This means the non-terminal metrics will never be decremented.  So 
you can have a cluster with 0 operations, but the metric for pending operations 
might be non-zero.


- Joseph Wu


On March 11, 2019, 12:14 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> ---
> 
> (Updated March 11, 2019, 12:14 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
> https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This commit adds additional metrics counting the
> number of operations in each state.
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp dc68fc324de7242737123015fbac19a2129778ce 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/4/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-14 Thread Benno Evers


> On March 12, 2019, 10:56 p.m., Joseph Wu wrote:
> > src/master/master.cpp
> > Lines 11361-11385 (original), 11378-11402 (patched)
> > 
> >
> > I'm curious what would be the proper way to handle operation 
> > cleanup/removal.
> > 
> > When an operation is transitioned into a terminal state, the master 
> > will usually `removeOperation(...)` shortly afterwards.  Since we don't 
> > decrement the metrics in this case, the number of terminal operations will 
> > continue to grow.  This seems like the proper behavior.
> > 
> > However, in this code, it is possible to remove an agent with 
> > non-terminal operations.  This means the non-terminal metrics will never be 
> > decremented.  So you can have a cluster with 0 operations, but the metric 
> > for pending operations might be non-zero.

Hm, good question. I think the only ways a slave gets removed while it still 
has operations pending is by either being marked gone, or becoming unreachable.

In both cases we already transition the counters to the correct 
`OPERATION_GONE`/`OPERATION_UNREACHABLE` states. (although unfortunately in a 
somewhat non-local manner, that's what https://reviews.apache.org/r/70185/ is 
all about)

For gone operations, this should be fine. However, the problem is that when an 
unreachable slave reregisters, we re-add all operations as new operations 
without decrementing the `operations_unreachable` metric, since at the time the 
`UpdateSlaveMessage` arrives the master already forgot that the slave was 
previously unreachable.

So as far as I can see, the metrics for pending operations should always be 
correct, but it is possible to overcount unreachable operations.

It's not clear if this can be fixed without quite far-reaching refactoring in 
the master. So I think the best course might be to either document this 
behaviour, or remove the `operations_unreachable` metric altogether.

What do you think?


- Benno


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review213632
---


On March 11, 2019, 7:14 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> ---
> 
> (Updated March 11, 2019, 7:14 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
> https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This commit adds additional metrics counting the
> number of operations in each state.
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp dc68fc324de7242737123015fbac19a2129778ce 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/4/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-14 Thread Joseph Wu


> On March 12, 2019, 3:56 p.m., Joseph Wu wrote:
> > src/master/master.cpp
> > Lines 11361-11385 (original), 11378-11402 (patched)
> > 
> >
> > I'm curious what would be the proper way to handle operation 
> > cleanup/removal.
> > 
> > When an operation is transitioned into a terminal state, the master 
> > will usually `removeOperation(...)` shortly afterwards.  Since we don't 
> > decrement the metrics in this case, the number of terminal operations will 
> > continue to grow.  This seems like the proper behavior.
> > 
> > However, in this code, it is possible to remove an agent with 
> > non-terminal operations.  This means the non-terminal metrics will never be 
> > decremented.  So you can have a cluster with 0 operations, but the metric 
> > for pending operations might be non-zero.
> 
> Benno Evers wrote:
> Hm, good question. I think the only ways a slave gets removed while it 
> still has operations pending is by either being marked gone, or becoming 
> unreachable.
> 
> In both cases we already transition the counters to the correct 
> `OPERATION_GONE`/`OPERATION_UNREACHABLE` states. (although unfortunately in a 
> somewhat non-local manner, that's what https://reviews.apache.org/r/70185/ is 
> all about)
> 
> For gone operations, this should be fine. However, the problem is that 
> when an unreachable slave reregisters, we re-add all operations as new 
> operations without decrementing the `operations_unreachable` metric, since at 
> the time the `UpdateSlaveMessage` arrives the master already forgot that the 
> slave was previously unreachable.
> 
> So as far as I can see, the metrics for pending operations should always 
> be correct, but it is possible to overcount unreachable operations.
> 
> It's not clear if this can be fixed without quite far-reaching 
> refactoring in the master. So I think the best course might be to either 
> document this behaviour, or remove the `operations_unreachable` metric 
> altogether.
> 
> What do you think?

It may be worthwhile to add a CHECK to make sure we only ever remove terminal 
(including GONE) or UNREACHABLE operations.

---

Per endless (possible) recounting of unreachable operations, I'd lean towards 
overcounting & documenting how/when this is possible.  Probably inside the 
operator document that lists/describes all the metrics.


- Joseph


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review213632
---


On March 11, 2019, 12:14 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> ---
> 
> (Updated March 11, 2019, 12:14 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
> https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This commit adds additional metrics counting the
> number of operations in each state.
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp dc68fc324de7242737123015fbac19a2129778ce 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/4/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-14 Thread Greg Mann


> On March 12, 2019, 10:56 p.m., Joseph Wu wrote:
> > src/master/master.cpp
> > Lines 11361-11385 (original), 11378-11402 (patched)
> > 
> >
> > I'm curious what would be the proper way to handle operation 
> > cleanup/removal.
> > 
> > When an operation is transitioned into a terminal state, the master 
> > will usually `removeOperation(...)` shortly afterwards.  Since we don't 
> > decrement the metrics in this case, the number of terminal operations will 
> > continue to grow.  This seems like the proper behavior.
> > 
> > However, in this code, it is possible to remove an agent with 
> > non-terminal operations.  This means the non-terminal metrics will never be 
> > decremented.  So you can have a cluster with 0 operations, but the metric 
> > for pending operations might be non-zero.
> 
> Benno Evers wrote:
> Hm, good question. I think the only ways a slave gets removed while it 
> still has operations pending is by either being marked gone, or becoming 
> unreachable.
> 
> In both cases we already transition the counters to the correct 
> `OPERATION_GONE`/`OPERATION_UNREACHABLE` states. (although unfortunately in a 
> somewhat non-local manner, that's what https://reviews.apache.org/r/70185/ is 
> all about)
> 
> For gone operations, this should be fine. However, the problem is that 
> when an unreachable slave reregisters, we re-add all operations as new 
> operations without decrementing the `operations_unreachable` metric, since at 
> the time the `UpdateSlaveMessage` arrives the master already forgot that the 
> slave was previously unreachable.
> 
> So as far as I can see, the metrics for pending operations should always 
> be correct, but it is possible to overcount unreachable operations.
> 
> It's not clear if this can be fixed without quite far-reaching 
> refactoring in the master. So I think the best course might be to either 
> document this behaviour, or remove the `operations_unreachable` metric 
> altogether.
> 
> What do you think?
> 
> Joseph Wu wrote:
> It may be worthwhile to add a CHECK to make sure we only ever remove 
> terminal (including GONE) or UNREACHABLE operations.
> 
> ---
> 
> Per endless (possible) recounting of unreachable operations, I'd lean 
> towards overcounting & documenting how/when this is possible.  Probably 
> inside the operator document that lists/describes all the metrics.

"the only ways a slave gets removed while it still has operations pending is by 
either being marked gone, or becoming unreachable"

This isn't true - for example, an agent could send an `UnregisterSlaveMessage` 
when it has pending operations and it will be removed via 
`Master::_removeSlave()`. Unfortunately, agent removal in that method currently 
does not correspond to any well-defined agent states; see 
https://issues.apache.org/jira/browse/MESOS-9556.

I think that we need to decrement non-terminal operation states when operations 
are removed in `_removeSlave()`.


- Greg


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review213632
---


On March 11, 2019, 7:14 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> ---
> 
> (Updated March 11, 2019, 7:14 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
> https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This commit adds additional metrics counting the
> number of operations in each state.
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp dc68fc324de7242737123015fbac19a2129778ce 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/4/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-14 Thread Greg Mann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review213734
---




src/master/metrics.hpp
Lines 82-84 (patched)


Hmm am I missing something, or are there no per-framework metrics added? 
I'm fine with not adding them, but this comment seems to be incorrect?


- Greg Mann


On March 11, 2019, 7:14 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> ---
> 
> (Updated March 11, 2019, 7:14 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
> https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This commit adds additional metrics counting the
> number of operations in each state.
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp dc68fc324de7242737123015fbac19a2129778ce 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/4/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-26 Thread Benno Evers

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/
---

(Updated March 26, 2019, 5:56 p.m.)


Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.


Changes
---

Changed accounting logic for gone/unreachable agents; documented over-counting 
behaviour


Bugs: MESOS-8241
https://issues.apache.org/jira/browse/MESOS-8241


Repository: mesos


Description
---

This commit adds additional metrics counting the
number of operations in each state.

Unit tests are added in the subsequent commit.


Diffs (updated)
-

  docs/monitoring.md 54b872f579dbc68ca5f67f4cc1ba34065a09aee2 
  src/master/master.cpp 9c4a9e83da94535873d72c902835f229c4f96320 
  src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
  src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 


Diff: https://reviews.apache.org/r/70116/diff/5/

Changes: https://reviews.apache.org/r/70116/diff/4-5/


Testing
---


Thanks,

Benno Evers



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-26 Thread Benno Evers


> On March 12, 2019, 10:56 p.m., Joseph Wu wrote:
> > src/master/master.cpp
> > Lines 11361-11385 (original), 11378-11402 (patched)
> > 
> >
> > I'm curious what would be the proper way to handle operation 
> > cleanup/removal.
> > 
> > When an operation is transitioned into a terminal state, the master 
> > will usually `removeOperation(...)` shortly afterwards.  Since we don't 
> > decrement the metrics in this case, the number of terminal operations will 
> > continue to grow.  This seems like the proper behavior.
> > 
> > However, in this code, it is possible to remove an agent with 
> > non-terminal operations.  This means the non-terminal metrics will never be 
> > decremented.  So you can have a cluster with 0 operations, but the metric 
> > for pending operations might be non-zero.
> 
> Benno Evers wrote:
> Hm, good question. I think the only ways a slave gets removed while it 
> still has operations pending is by either being marked gone, or becoming 
> unreachable.
> 
> In both cases we already transition the counters to the correct 
> `OPERATION_GONE`/`OPERATION_UNREACHABLE` states. (although unfortunately in a 
> somewhat non-local manner, that's what https://reviews.apache.org/r/70185/ is 
> all about)
> 
> For gone operations, this should be fine. However, the problem is that 
> when an unreachable slave reregisters, we re-add all operations as new 
> operations without decrementing the `operations_unreachable` metric, since at 
> the time the `UpdateSlaveMessage` arrives the master already forgot that the 
> slave was previously unreachable.
> 
> So as far as I can see, the metrics for pending operations should always 
> be correct, but it is possible to overcount unreachable operations.
> 
> It's not clear if this can be fixed without quite far-reaching 
> refactoring in the master. So I think the best course might be to either 
> document this behaviour, or remove the `operations_unreachable` metric 
> altogether.
> 
> What do you think?
> 
> Joseph Wu wrote:
> It may be worthwhile to add a CHECK to make sure we only ever remove 
> terminal (including GONE) or UNREACHABLE operations.
> 
> ---
> 
> Per endless (possible) recounting of unreachable operations, I'd lean 
> towards overcounting & documenting how/when this is possible.  Probably 
> inside the operator document that lists/describes all the metrics.
> 
> Greg Mann wrote:
> "the only ways a slave gets removed while it still has operations pending 
> is by either being marked gone, or becoming unreachable"
> 
> This isn't true - for example, an agent could send an 
> `UnregisterSlaveMessage` when it has pending operations and it will be 
> removed via `Master::_removeSlave()`. Unfortunately, agent removal in that 
> method currently does not correspond to any well-defined agent states; see 
> https://issues.apache.org/jira/browse/MESOS-9556.
> 
> I think that we need to decrement non-terminal operation states when 
> operations are removed in `_removeSlave()`.

I'm now decrementing non-terminal states (and completely removed 
https://reviews.apache.org/r/70185/ while re-working the accounting).


- Benno


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review213632
---


On March 26, 2019, 5:56 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> ---
> 
> (Updated March 26, 2019, 5:56 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
> https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This commit adds additional metrics counting the
> number of operations in each state.
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -
> 
>   docs/monitoring.md 54b872f579dbc68ca5f67f4cc1ba34065a09aee2 
>   src/master/master.cpp 9c4a9e83da94535873d72c902835f229c4f96320 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/5/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-26 Thread Joseph Wu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review214068
---


Fix it, then Ship it!




LGTM.

Two small-ish comments to tweak before commit.


docs/monitoring.md
Lines 684-685 (patched)


You can mention that terminal states, like FINISHED, FAILED, ERROR, 
DROPPED, or GONE are counters.  Non-terminal states are gauges.

Here and a couple lines below too.

The distinction is (sometimes) important when consuming these metrics 
because Counters are usually consumed as differences over time (e.g. how many 
more finished operations are there since the last minute?) whereas gauges are 
consumed as-is.



src/master/metrics.hpp
Lines 82-84 (patched)


The framework-wide metrics are not implemented in this review, so you could 
remove mention of them for now.


- Joseph Wu


On March 26, 2019, 10:56 a.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> ---
> 
> (Updated March 26, 2019, 10:56 a.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
> https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This commit adds additional metrics counting the
> number of operations in each state.
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -
> 
>   docs/monitoring.md 54b872f579dbc68ca5f67f4cc1ba34065a09aee2 
>   src/master/master.cpp 9c4a9e83da94535873d72c902835f229c4f96320 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/5/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-27 Thread Benno Evers


> On March 15, 2019, 12:03 a.m., Greg Mann wrote:
> > src/master/metrics.hpp
> > Lines 82-84 (patched)
> > 
> >
> > Hmm am I missing something, or are there no per-framework metrics 
> > added? I'm fine with not adding them, but this comment seems to be 
> > incorrect?

They're not added in this review because they already exist :) See line 816-836 
inside `FrameworkMetrics::FrameworkMetrics()`.


- Benno


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review213734
---


On March 26, 2019, 5:56 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> ---
> 
> (Updated March 26, 2019, 5:56 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
> https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This commit adds additional metrics counting the
> number of operations in each state.
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -
> 
>   docs/monitoring.md 54b872f579dbc68ca5f67f4cc1ba34065a09aee2 
>   src/master/master.cpp 9c4a9e83da94535873d72c902835f229c4f96320 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/5/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-27 Thread Greg Mann


> On March 15, 2019, 12:03 a.m., Greg Mann wrote:
> > src/master/metrics.hpp
> > Lines 82-84 (patched)
> > 
> >
> > Hmm am I missing something, or are there no per-framework metrics 
> > added? I'm fine with not adding them, but this comment seems to be 
> > incorrect?
> 
> Benno Evers wrote:
> They're not added in this review because they already exist :) See line 
> 816-836 inside `FrameworkMetrics::FrameworkMetrics()`.

Those metrics are slightly different - they provide counters for the offer 
operations submitted by a particular framework, but don't track the operation 
states. It might be worth mentioning in this comment that offer operation 
counts are tracked on a per-framework basis.


- Greg


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review213734
---


On March 26, 2019, 5:56 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> ---
> 
> (Updated March 26, 2019, 5:56 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
> https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This commit adds additional metrics counting the
> number of operations in each state.
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -
> 
>   docs/monitoring.md 54b872f579dbc68ca5f67f4cc1ba34065a09aee2 
>   src/master/master.cpp 9c4a9e83da94535873d72c902835f229c4f96320 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/5/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-27 Thread Benno Evers

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/
---

(Updated March 27, 2019, 4:39 p.m.)


Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.


Changes
---

Clarified comment; clearly distinguished between gauges and counters in 
documentation.


Bugs: MESOS-8241
https://issues.apache.org/jira/browse/MESOS-8241


Repository: mesos


Description
---

This commit adds additional metrics counting the
number of operations in each state.

Unit tests are added in the subsequent commit.


Diffs (updated)
-

  docs/monitoring.md 54b872f579dbc68ca5f67f4cc1ba34065a09aee2 
  src/master/master.cpp 9c4a9e83da94535873d72c902835f229c4f96320 
  src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
  src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 


Diff: https://reviews.apache.org/r/70116/diff/6/

Changes: https://reviews.apache.org/r/70116/diff/5-6/


Testing
---


Thanks,

Benno Evers



Re: Review Request 70116: Added metrics for offer operation feedback.

2019-03-27 Thread Benno Evers


> On March 15, 2019, 12:03 a.m., Greg Mann wrote:
> > src/master/metrics.hpp
> > Lines 82-84 (patched)
> > 
> >
> > Hmm am I missing something, or are there no per-framework metrics 
> > added? I'm fine with not adding them, but this comment seems to be 
> > incorrect?
> 
> Benno Evers wrote:
> They're not added in this review because they already exist :) See line 
> 816-836 inside `FrameworkMetrics::FrameworkMetrics()`.
> 
> Greg Mann wrote:
> Those metrics are slightly different - they provide counters for the 
> offer operations submitted by a particular framework, but don't track the 
> operation states. It might be worth mentioning in this comment that offer 
> operation counts are tracked on a per-framework basis.

Oh, right. Updated the comment.


- Benno


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70116/#review213734
---


On March 27, 2019, 4:39 p.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70116/
> ---
> 
> (Updated March 27, 2019, 4:39 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, Greg Mann, and Joseph Wu.
> 
> 
> Bugs: MESOS-8241
> https://issues.apache.org/jira/browse/MESOS-8241
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This commit adds additional metrics counting the
> number of operations in each state.
> 
> Unit tests are added in the subsequent commit.
> 
> 
> Diffs
> -
> 
>   docs/monitoring.md 54b872f579dbc68ca5f67f4cc1ba34065a09aee2 
>   src/master/master.cpp 9c4a9e83da94535873d72c902835f229c4f96320 
>   src/master/metrics.hpp 4495e65b6bb11f7236335a702c4f61e7c3f9b0aa 
>   src/master/metrics.cpp 4dd73fb18a06ce8f75c4c1435dba84ade123bee9 
> 
> 
> Diff: https://reviews.apache.org/r/70116/diff/6/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>