Re: Providing near-time averaged monitoring data for mod_systemd and mod_status

Rainer Jung Mon, 27 Apr 2020 09:13:42 -0700

Am 27.04.2020 um 15:57 schrieb Joe Orton:

On Sat, Apr 25, 2020 at 08:10:40PM +0200, Rainer Jung wrote:

Patch available at


home.apache.org/~rjung/patches/httpd-trunk-mon-snaps-v1_2.patch


Very nice!  +1 from me.

Does the times_per_thread logic still make any sense?  It's always been
wrong for Linux AFAICT so maybe can just be dropped.


I will do some code archeology to find out who added it and why.

A minor nit, I think the snap_index should be read/written using
apr_atomic_t since it's going to be accessed concurrently across
threads?

Yup. Since we only have concurrent reads, it should suffice to check thevalue with apr_atomic_read32 and then use apr_atomic_set32 to set it the0 or 1.

I wonder how we best make sure, that the new monitor hook is not calledfrom modules? Currently it is no AP_DECLARE'd but contained in httpd.h.And the hook registration is in server/core.c, but the implementation inserver/scoreboard.c. No idea (not tried) whether it would be better tohave it in server/core.c as a static function or whether that's notneeded and it would suffice to move the function declaration to aprivate header file.


Thanks and regards,

Rainer

Some notes:

- in order to use the data from mod_systemd (monitor hook), which runs in
the maim process, and also from mod_status, so from child processes, it
needs to sit in shared memory. Since most of the data is tightly coupled to
the scoreboard, I added the new cached data to the global part of the
scoreboard.

- it was then IMHO a good fit to move the existing ap_get_sload(ap_sload_t
*ld) from server/util.c to server/scoreboard.c.

- ap_sload_t is used as a collection of data which can be used to generate
averaged data from a pair of ap_sload_t structs. It was extended to also
contain the total request duration plus total usr and sys CPU and the
timestamp when the data was taken. So it should now contain all data from
which mod-systemd and mod_status currently like to display derived averaged
values.

- busy and idle in ap_sload_t have been changed from int to float, because
they were already only used a percentages. IMHO it doesn't make sense to
express idle and busy as percentages based on a total of busy plus idle,
because that sum is not a meaningful total. The server can grow by creating
new processes and suddenly your busy percentage might shrink, although the
absolute number of busy threads increases. That's counter intuitive. So I
added the unused process slots to the total count, and we have three
counters that sum up to this total, busy, idle and dead. We need a better
name than "dead" for these unused process slots that might get used later.
"unused" is to close to "idle" and I don't have a good name yet.

- the new ap_mon_snap_t contains a pointer to an ap_sload_t, six averaged
values generated from two ap_sload_t and a member that conatins the time
delta between those two ap_sload_t. The ap_sload_t referenced by
ap_mon_snap_t contains the data at time t1. During the next monitor run, new
t1 data will be collected and the previous t1 data will be used as t0 data
to generate the new averages.

- the scoreboard contains two ap_mon_snap_t (plus the two ap_sload_t
referenced by them) to be able to update one of them without breaking access
by consumers to the other one. After the update the roles get switched.

- both structs, ap_sload_t and ap_mon_snap_t are declared in httpd.h,
because ap_sload_t already had been there. t might be a better fit to move
them to scoreboard.h, but I'm not sure about that and I don't know if that
would be allowed by the compatibility rules.

- also in httpd.h there are now three new function declarations. Two only
used by server/core.c as hook functions:

   int ap_scoreboard_monitor(apr_pool_t *p, server_rec *s);
   void ap_scoreboard_child_init(apr_pool_t *p, server_rec *s);

and one for public consumption:

   AP_DECLARE(void) ap_get_mon_snap(ap_mon_snap_t *ms);

- mod_systemd and mod_status now call ap_get_mon_snap(&snap) to retrieve the
latest averaged data. mod_status still uses the scoreboard in addition to
retrieve up-to-date current scalar metrics. Small adjustments to the
mod_status output plus additions to mod_systemd notification messages. Tne
STATUS in the notification of mod_systemd now looks liek this:

STATUS=Pct Idle workers 28.4; Pct Busy workers 10.6; Pct Dead workers 60.9;
Requests/sec: 5030; Bytes served/sec: 2.9MB/sec; Bytes served/request: 596
B/req; Avg Duration(ms): 5.78; Avg Concurrency: 29.1; Cpu Pct: 40.5

- scoreboard.c changes:

   - take over ap_get_sload(ap_sload_t *sl) from server/util.c.
     Added copied code from mod_status to that function to also add up the
request duration and the usr and sys CPU times.

   - ap_scoreboard_monitor() runs in the monitor hook. It calls
ap_get_sload() and then a static utility function calc_mon_data() to
calculate the new averages.

- Minor mmn change not yet part of the patch.

It compiles fine (maintainer mode) on RHEL 7 x86_64 and on Solaris 10 Sparc
and I did some tests with mod_status and mod_systemd.

Regards,

Rainer

Am 24.04.2020 um 18:32 schrieb Rainer Jung:

Am 24.04.2020 um 16:21 schrieb Joe Orton:

On Fri, Apr 24, 2020 at 12:17:19PM +0200, Rainer Jung wrote:

Thinking further: I think it would make sense to have a module or core
implement the monitor hook to generate that derived data (requests/sec,
bytes/sec, durationMs/request, avgConcurrency) in the last
monitor interval
and to provide that data to consumers like mod_systemd or - new
- mod_status
- instead of the long term averages since start. It could
probably be added
to the code that already provides "sload". That way mod_status
would also
profit from the more precise average values (taken over the last monitor
interval).


I definitely like the patch, it has bothered me that the "per second"
stats are not very useful but wasn't sure how to make it better.

This is also an interesting idea.

So you would suggest having a new monitor hook which runs REALLY_LAST in
the order, calls ap_get_sload() and stores it in a global, and then we'd
have an ap_get_cached_sload() (or whatever) which gives you the cached
data from the last iteration?  Or are you thinking of a more
sophisticated API which does the "diff" between intervals internally?


Thanks for the positive feedback.

The averaged metrics IMHO only make sense as cached data, updated in
regular intervals and provided for use by various modules (probably only
mod_systemd and mod_status).

I would like to provide the already averaged data in a struct, each
metric as a float or double. The bytes/request probably not already
human readably scaled, because it makes its use less flexible. Since we
already also have the absolute counters at that point, we can easily add
them to the same struct as 32 or 64 bit counters and return a consistent
set of data (five old values, five new values, five averages and two
time stamps). [idle(32), busy(32), requests(64), bytes(64),
duration(64); req/s, bytes/s, bytes/req, dur/s, dur/req]. So consumers
needing a consistent view can get it.

Even more so since the absolute metrics are currently not cheap to
access. We collect all of them by iterating over the scoreboard and
summing up. By adding them to the cached data, the consuming code could
decide, whether such near-time data is good enough or it needs to
acquire new and curent counters. For mod_systemd, cached data (10 second
interval) might be OK.

For some modules - like mod_status - cached averages are fine, but I
think the counters should be correct for the point in time the status
request was handled by the module. So the scoreboard statistics code in
mod_status unfortunately would not go away.´, but the data quality for
the averages would become better.

Implementation wise I am thinking about adding

    ap_hook_monitor(mon_avg_monitor, NULL, NULL, ???);

to server/util.c, which calculates the new averages and

    ap_get_mon_avg(ap_mon_avg_t *ma)

which returns the four averages in a struct similar to the existing
ap_get_loadavg() and ap_get_sload().

We might have a little hassle to make the statistics update
atomic/thread-safe (eg. two instances of the internal data struct, so
that we only need to make the switch between them after the new
calculation atomic).

About REALLY_LAST: why last? If other modules collect data via this API
and wasn't to do it in the monitor hook as well, shouldn't run the
caching of data REALLY_FIRST, so you get the new averages?

I'll try to draft and test something along these lines later today. Fun
stuff. And more comments are very welcome.

And I like, that mod_status will profit by showing betrer averages as well.

Regards,

Rainer

Am 23.04.2020 um 21:29 schrieb Rainer Jung:

Hi all,

triggered by the new mod_systemd I drafted a patch to enhance the
monitoring data it provides during the monitor hook run.

Currently it publishes important data, like idle and busy slots and
total request count, but also not so useful info like requests/second
and bytes/second as a long term average (since start). These two figues
tend to become near constant after a longer time of operation.

Since the monitor hook of the module always seems to run in the same
(parent) process, it is easy to remember the previous request and byte
count data and average only over the last monitor hook interval. This
should give more meaningful data. And is a change local to mod_systemd.

In addition we have a third metric available in the scoreboard, namely
the total request duration. From that we can get the average request
duration and the average request concurrency. This part also needs a
change to the sload structure. Maybe we need a minor MMN bump for that.

I scetched a patch under

home.apache.org/~rjung/patches/httpd-trunk-mod_systemd-interval-stats.patch


Any comments, likes or dislikes?

Thanks and regards,

Rainer


--
kippdata
informationstechnologie GmbH   Tel: 0228 98549 -0
Bornheimer Str. 33a            Fax: 0228 98549 -50
53111 Bonn                     www.kippdata.de

HRB 8018 Amtsgericht Bonn / USt.-IdNr. DE 196 457 417
Geschäftsführer: Dr. Thomas Höfer, Rainer Jung, Sven Maurmann

Re: Providing near-time averaged monitoring data for mod_systemd and mod_status

Reply via email to