Change in osmo-bsc[master]: add time_cc API: cumlative counter for time, reported as rate_ctr

neels Thu, 04 Nov 2021 04:28:46 -0700

neels has posted comments on this change. ( 
https://gerrit.osmocom.org/c/osmo-bsc/+/25973 )


Change subject: add time_cc API: cumlative counter for time, reported as 
rate_ctr
......................................................................


Patch Set 1:

Sorry for writing so much, but it seems necessary...


First off, still open point, orthogonal to rate_ctr vs stat_item decision:

Should I remove the configurability features and reduce to one fixed counting 
behavior? e.g. fix to granularity of seconds, and to the round() rounding 
scheme? (The customer expressed that either round() or ceil() would be 
suitable, and floor() is not desirable) ... I think it makes more sense to keep 
that configurability stuff, now that the code works correctly already. It is a 
bit of feature creep, the only two arguments to keep it is handwavy "maybe 
useful at some point in the future"/"maybe some user likes it idk", and a 
concrete "it would require investing even more effort to remove the features"

> Well the question would then be: Can one still use the same external tools 
> (grafana, elastic search, etc) with rate_ctr? I'm not sure how are those 
> exported over statsd.

Yes, of course! The customer expressed the preferred way of reporting would be 
a rate counter.
Let me explain why, hopefully making more sense this time:

In the stats exporting, the main difference is that

- a stat_item is exported as the current value in each report. stat_item makes 
sense for values that rise and (possibly) fall, and where you want to read the 
actual current value, like number of active cells or say CPU load in percent, 
or uptime.

- a rate_ctr is exported as nr of increments since the last stat report. Makes 
sense for counting spike events over time. It is suitable where one is 
interested in increments of a value per time, to see how busy a constantly 
rising value currently is, rather than the absolute value itself.

A stat_item used for this cumulative time counter looks like a slope ramping 
up, like a staircase: staying a horizontal line at times of no chan exhaustion, 
and rising at times of chan exhaustion; taken to relative infinity, the value 
would at some point wrap the integer range. Imagine osmo-bsc ran for weeks, 
then the value could be a flat line at say 123000, and inc to 123001 when chan 
exhaustion occurs. For displaying such graph, if you're showing the entire y 
axis range, the increment is hardly visible. You need to zoom in on the y axis 
range say 123000 to 123100 to even be able to see that channels are currently 
exhausted. Or you need to employ math to graph the gradient of the line 
instead, so that you get 0 for non-exhausted times and spikes for exhausted 
times. We're interested in the gradient's spikes, i.e. in the "1", and not in 
the "123000" baseline.

A rate_ctr looks like a city skyline, flat line at value 0 for no exhaustion, 
with spikes at times where chan exhaustion occurs, going back to zero when 
exhaustion is over.
IOW it already *is* the gradient of the chan exhaustion time counter, which is 
exactly the interesting information: the number of exhausted seconds since the 
last stat report.

A stat_item *would* make sense if it reported, say, the current percent of time 
where channels are exhausted. If you see the stat showing 100, you know the 
channels are currently all exhausted for all of the time. So something where 
the current value is the interesting metric. This however introduces complex 
design decisions: over what amount of time do we calculate the percentage? 
should that be configurable? when/how do we degrade the percentage when 
exhaustion is over?

A rate_ctr *is* the simplest, least convoluted and true way of passing the 
actual useful information to an external stats tool, "and letting grafana 
figure it out" if the user wants some exhaustion percentage graph; and letting 
the user's infrastructure figure out whether to evaluate exhaustion over 
5/15/30/60 minutes as the spec suggests, without actually introducing these 
choices to the osmo-bsc code base.

> In general I think the main difference on how we see it, is that your focus 
> is to have it look nice when using VTY

Not at all. Looking useful on the VTY is just a side argument.
It is a compelling argument nevertheless, the main point being to visualize to 
you that a rate counter is the proper design choice. Read: "even the VTY output 
becomes more useful".
If I want to quickly check channel exhaustion without graphana, a stat item is 
very much harder to interpret than a rate counter: you need to repeatedly watch 
the value change. A rate counter gives you instant information about the 
gradient, as explained earlier.

I have asked a number of times, but you have still not explained to me how a 
stat_item value should be designed in a useful and simple way. As I'm pointing 
out, a forever rising value is not very useful.

> while my point is that it should in first place be usable for external tools.

It *is* more useful to external tools as a rate counter.

> Moreover, I have the feeling you are just abusing the rate_ctr infrastructure 
> with some logic
> just to get some output in VTY which you can understand (rate_ctr is aimed at 
> tick events, not counting time).

Please understand that this is exactly what a rate counter is designed for.
We are interested in the current gradient, not the current value.
The "tick event" here being "channel exhaustion occured for one entire second". 
Do not be confused by the fact that we are counting time over time. Time is 
involved twice in this metric! The metric is: "exhausted time, over time". Not 
simply "time since X", like e.g. uptime would be. That is an important 
difference that needs to be acknowledged.

I started out a long time ago thinking that a stat item would be best, and the 
math about it as well as a customer discussion convinced me otherwise. I would 
appreciate if you could acknowledge these arguments, and, if the argument is 
flawed, actually suggest a detailed way of reporting as stat item in a useful 
way. What I am reading so far is merely generally brushing over my argument, 
and i read a dismissive tone, hope I'm wrong there. I would appreciate if we 
could keep this technical and detailed.

I'm happy to change this and make it more useful, if there is a compelling 
argument to do so. Haven't seen one yet. What part am I not getting?


--
To view, visit https://gerrit.osmocom.org/c/osmo-bsc/+/25973
To unsubscribe, or for help writing mail filters, visit 
https://gerrit.osmocom.org/settings

Gerrit-Project: osmo-bsc
Gerrit-Branch: master
Gerrit-Change-Id: Icdd36f27cb54b2e1b940c9e6404ba9dd3692a310
Gerrit-Change-Number: 25973
Gerrit-PatchSet: 1
Gerrit-Owner: neels <nhofm...@sysmocom.de>
Gerrit-Reviewer: Jenkins Builder
Gerrit-Reviewer: laforge <lafo...@osmocom.org>
Gerrit-CC: pespin <pes...@sysmocom.de>
Gerrit-Comment-Date: Thu, 04 Nov 2021 11:27:46 +0000
Gerrit-HasComments: No
Gerrit-Has-Labels: No
Gerrit-MessageType: comment

Change in osmo-bsc[master]: add time_cc API: cumlative counter for time, reported as rate_ctr

Reply via email to