Change in osmo-bsc[master]: add time_cc API: cumlative counter for time, reported as rate_ctr

neels Mon, 01 Nov 2021 05:32:29 -0700

neels has posted comments on this change. ( 
https://gerrit.osmocom.org/c/osmo-bsc/+/25973 )


Change subject: add time_cc API: cumlative counter for time, reported as 
rate_ctr
......................................................................


Patch Set 1:

> > Well maybe then the question is why are you using rate_ctr and not 
> > stat_items here, it really confuses me.
> 
> At least at first sight, I agree.  The resulting metric computed by this  new 
> code base renders a single value which matches better a state_item than a 
> rate_ctr. Any particular argument to go for rate_ctr, Neels?

The decision to use a rate_ctr is based on discussion with the customer,
and it also makes a lot of sence in practice.

Logically, a stat_item is not actually a good choice. We can of course report 
the total time of all-allocated, and thus get for example the complete amount 
of seconds that all SDCCH channels were allocated since osmo-bsc started. But 
it's not interesting to get an arbitrary amount of time of all-allocated since 
forever; instead, it is important to qualify in which period of elapsed time 
this amount was accumulated. A rate_ctr is well suited since it also provides 
the "per time" aspect. All rate_ctr stats reflect a number-of-events-per-time. 
For all_allocated, it is the number of seconds that all channels were allocated 
per a given amount of time. For example, if the VTY shows all_allocated:sdcch 
of 10/min, it means all channels were allocated for 10 seconds of the last 
minute. For a stat item, getting this "per time" part is a complex problem.

When reporting as a stat_item, we open a new dimension of options:
The spec defines different reporting periods, suggesting at least the options 
of 5 minutes, 15 minutes, 30 minutes, 60 minutes. We could periodically clear 
the stat item based on user config.
The customer requesting this feature already implements these reporting periods 
outside of osmo-bsc, based on stats received from osmo-bsc. So instead of 
introducing these reporting periods to osmo-bsc and choose some method of 
adding a per-time aspect to stat_item, it is best to just trigger a count for 
each second of all-allocated-channels.

> simply a counter value changing over time.

When I started on it, I thought it would take half an hour.
When thinking about the exact implementation, the options and complexity 
unfolded...
This patch is the result that ensures correct counts with minimal complexity.

> So I'm not really following on why you need all this infrastructure sorry,

I would appreciate if your criticism could be qualified as well as constructive.
What do you mean by "all this"? What do you suggest instead?

> this all looks super complicated for no reason (I'm able to see). Maybe 
> someone else can also shed some light on it.

It's straightforward:

The aim is to report for how many seconds per given time period all channels of 
a type were allocated.
To achieve that, we need to count free/allocated lchans.
When a count reveals that all chans of type X are allocated, we set a flag to 
true.
Based on that flag, a time counter increments. The flag-per-time counter is 
generalized API (time_cc).
In order to periodically report that time counter to stats, an osmo_timer is 
involved.

I am open to simplifications, if possible.

There are some additional options to configure time_cc with different 
granularity,
and to allow tweaking the counter precision vs response time.
These options aren't strictly necessary. I think they make sense to keep 
time_cc generally useful.

> So the question remains: Should the result be exposed as rate_ctr or as 
> stat_item?

We could do both, in fact. All the complex parts are already implemented and 
working correctly.

Next to the rate_ctr, we can just add a stat_item to time_cc, and publish the 
time count as stat item. But then we need to define the time periods and exact 
meaning of the stat_item values.
I encourage you to practically imagine the solution and you should see how the 
problem is not as trivial as it sounds at first. It is easy to add the 
stat_item, as soon as it is clear which value the stat_item should reflect. We 
already have a value implemented that counts all seconds where all channels 
were allocated since osmo-bsc started. But does it make sense to publish that 
as stat_item?

Here are the various ideas I had before we decided for a rate_ctr as the 
simplest and most effective solution:

"
I am thinking about the allAvailable{TCH,SDCCH}AllocatedTime indicators:

In 3GPP TS 52.402, there is a defined Granularity Period, which is configurable,
and suggested to have at least the settings of 5, 15, 30, 60 minutes.
The allAvailableXxxAllocatedTime indicators are defined as cumulative counter 
(CC),
which I interpret as the number of seconds that all channels of the given kind 
were occupied.

A "problem" is that the meaning of this cumulative value depends on the 
Granularity Period.
For example, if the granularity period is 30 minutes, a cumulative value of 5 
minutes for
"all channels allocated" means that the cell was congested roughly 17% of the 
time.
If the granularity period is only 5 minutes, then the exact same value means 
100% congestion.
So it appears to me that it is less confusing / more meaningful to report the 
value in % of time?

Looking at details of how to implement this, it appears that we need to first 
introduce this concept
of a Granularity Period to our statistics API. We have a stats reporter 
interval, which is usually
a lot shorter than 5 minutes. Also this interval so far only affects the times 
at which an independently
defined value will become reported. IIUC we so far don't have any values that 
are dependent on the
reporting interval itself, where some cumulative counter value gets reset to 
zero whenever a reporting
period has elapsed.

Here are my ideas to implement such cumulative counters:

variant 1:
Internally, we clearly define a Granularity Period, as described in the spec. 
Let's say it is set to 5 minutes.
This Granularity Period is implemented completely independently from the stats 
reporting period.
At first, the cumulative counter is zero. For the next 5 minutes, we add up the 
times (in seconds) where all
channels were occupied. When the five minutes have elapsed, we "push" the 
cumulative value to a stat item and
reset the counter. So only one value will be published in a stat item every 5 
minutes, and the value does not
change while we are busy accumulating the counter value for the next 5 minutes.
This seems most spec conforming. But this also seems kind of low resolution / 
slowly responsive.
The 5 minute period would be independent from the stat reporting period, i.e. 
there would be N stat reporting
periods where the stat does not change at all, e.g. for 5 minutes, and only 
then would we get a sum of the last
5 minutes, again staying fixed on the dashboard for the next 5 minutes.

variant 2:
We have two rate counters, one incrementing for each second where all channels 
were occupied (A), one incrementing
for each second where at least one channel was still available (B). These get 
reported continuously and also degrade
as rate counters do. Comparing one to the other, e.g. A / (A + B), gives a 
continuous indication of congestion rate.
So the value will gradually rise and fall as the seconds pass, and we don't 
have to wait five minutes to see that
congestion has occured.

variant 2b:
It should actually suffice to have only one rate counter incrementing for each 
second where all channels were occupied.
Since rate counters implicitly count events per second, per minute, per hour, 
we can see that e.g. a rate of
60 per minute means that we have been continuously congested for the last 
minute.

variant 3:
We introduce a new kind of cumulative stat item which gets reset to zero 
whenever a stat reporting period has elapsed.
We have two such stat items, one counting the seconds congested (A), one 
counting seconds not congested (B),
and a meaningful statistic comes from comparing A to A+B. (the reporting period 
may then fluctuate without ill effects)

variant 3b:
Such new cumulative stat item as in 3 may always implicitly report percent 
compared to the elapsed reporting period.

variant 3c:
just use a normal stat item, and introduce some callback function that can be 
set up to clear the stat item to zero
every time the stat report has been sent out.

For variant 2 (rate counters), we don't need to introduce configuration of a 
granularity period, nor invent a new kind
of stat item. But this is also the farthest away from how the performance 
indicator is defined in the spec.

We could also implement mutiple variants. To me it would make sense to 
implement both variant 1 and 2b,
to have a most spec conforming stat item that reports less frequently, as well 
as a "running congestion counter" as
a rate counter that continuously shows a curve of congestion seen per time.
"


--
To view, visit https://gerrit.osmocom.org/c/osmo-bsc/+/25973
To unsubscribe, or for help writing mail filters, visit 
https://gerrit.osmocom.org/settings

Gerrit-Project: osmo-bsc
Gerrit-Branch: master
Gerrit-Change-Id: Icdd36f27cb54b2e1b940c9e6404ba9dd3692a310
Gerrit-Change-Number: 25973
Gerrit-PatchSet: 1
Gerrit-Owner: neels <nhofm...@sysmocom.de>
Gerrit-Reviewer: Jenkins Builder
Gerrit-CC: laforge <lafo...@osmocom.org>
Gerrit-CC: pespin <pes...@sysmocom.de>
Gerrit-Comment-Date: Mon, 01 Nov 2021 12:32:21 +0000
Gerrit-HasComments: No
Gerrit-Has-Labels: No
Gerrit-MessageType: comment

Change in osmo-bsc[master]: add time_cc API: cumlative counter for time, reported as rate_ctr

Reply via email to