Hi Marcin!

On Tue, May 12, 2020 at 12:17:19PM +0200, Marcin Deranek wrote:
> Hi,
> 
> Not a long ago while investigating some issues with one of the services I
> stumbled across Connect Time (based on ctime metric) graphs (see
> ctime-problem.png). It turned out that metrics were nowhere near reality -
> they were trying to reach real average value, but never got there as each
> reload reset it back to 0. Keep in mind this was happening on a
> multi-process HAProxy setup for a service with relatively low/medium amount
> of traffic. I did a bit of investigation, tested different scenarios /
> algorithms, selected the most optimal one and deployed it on one of our
> load balancers (see low-traffic-compare.png and
> medium-traffic-compare.png). ctime represents current algorithm and ctime2
> represents new algorithm for calculating moving averages. As you can see,
> differences can be dramatic over long periods of time.

The improvements are indeed really nice!

> Drops of ctime
> metric represent reloads of an HAProxy instance. Spikes of ctime2 are due
> to http-reuse option - after reloading a new instance cannot reuse existing
> connections, so it has to establish new connections, so timing goes up
> (this is expected).
> See a proposed patch. Few comments about the patch:
> - The patch changes behaviour of moving average metrics generation (eg.
> ctime, ttime) in a way that metric is not generated if there is no data.
> Currently metrics start with 0 and it's not possible to distinguish if
> latency is 0 (or close to 0) or there is no data. You can always check
> req_tot or lbconn (depending on mode), but that makes things much more
> complicated thus I decided to only expose those metrics if there is data
> (at least 1 request has been made). Gaps on low-traffic-compare.png graph
> indicate that during that period there were no requests and thus we return
> no data.

In fact it's a different metric (though very useful). I've had the same
needs recently. The current ctime reports the avg time experienced by the
last 1024 *requests* and is documented as such, so when you want to think
in terms of user experience it's the one to consider. For example, if you
have a 99% reuse rate and 1% connect rate, even a DC far away at 100ms
will only add 1ms on average to the request time, because 99% of the time,
the connect time really is zero for the request. Ditto if you're using
the cache. Your timer reports the average connect time per *connection*,
and there it's much more suitable to analyse the infrastructure's impact
on performance. Both are equally useful but do not report the same metric.

I'd be in favor of creating a new one for yours. Anyway we're still missing
a number of other ones like the average TLS handshake cost on each side,
which should also be useful both per connection and per request. I'm saying
this in case that helps figuring a pattern to find a name for this one :-)

> - I haven't changed a similar swrate_add_scaled function as it's not used
> yet and the description feels a bit misleading to me.

So I'd rather have a different set of functions for this (and we'd update
the stats code to use these). The reason is that these ones were made to
be usable at extremely low cost in the lower layers. For example we do use
them in the activity measurements that are at the heart of the polling loop
and I also have a pending patch to use them in pool_free() under a thread
lock where we definitely don't want to spend our time adding an integer
divide when the current code only requires one addition and a right shift.
Because typically, knowing that for these core operations that can easily
be called billions of times a day, we'd have extra instructions only to
improve the accuracy of the measurement during the very first millisecond
would be a waste.

We could call the new ones swrate_add_range() and swrate_avg_range(), or
_partial for example, or anything more suitable you'd come up with.

Such a patch could have a low enough impact to be backported to the recent
versions (2.1 and 2.0) I guess.

I like your approach, because I've very recently been thinking about this
as well but came up with a bad idea which I didn't want to implement. I
thought that we would have the first sample fill the whole average. But
that required to track the info that it was the first sample, which I
thought was not easy. But actually for each of them you managed to figure
from a companion metric if there are samples or not, and your approach
would be more accurate than mine that gave too much importance to the first
sample.

I've seen a few things however in your patch:

> -static inline unsigned int swrate_add(unsigned int *sum, unsigned int n, 
> unsigned int v)
> +static inline unsigned int swrate_add(unsigned int *sum, unsigned int n, 
> unsigned int v, int all)
>  {
>       unsigned int new_sum, old_sum;
>  
>       old_sum = *sum;
>       do {
> -             new_sum = old_sum - (old_sum + n - 1) / n + v;
> +             new_sum = old_sum + v - (all ? old_sum / n : 0);

Be careful above, you're losing precision by not adding n-1 to the value,
you're systematically rounding it down to the nearest multiple of n while
it was rounded up, so your older values will fade away quicker. If you want
to maintain accuracy, you need to keep the fractional part of the operation.
It's even possible to further improve the accuracy by only adding n/2:

                new_sum = old_sum + v - (all ? (old_sum + n / 2) / n : 0);

In addition, since the sole point of "all" above is to ignore the use of
"n" applied to the past value, it'd be better to fuse it into "n" by
declaring that if n==0 we simply don't fade out the old value, which
further simplifies the output code:

                new_sum = old_sum + v - (n ? (old_sum + n / 2) / n : 0);

Last, in terms of readability I find it preferable to have the old sum
computation on one side and the new sample (v) on the other side. i.e.

                new_sum = v + old_sum - (n ? (old_sum + n / 2) / n : 0);

(or v at the end, I don't care, all produce the exact same code anyway).

>  static inline unsigned int swrate_avg(unsigned int sum, unsigned int n)
>  {
> -     return (sum + n - 1) / n;
> +     return sum / n;
>  }

Same point for this one, values are systematically rounded down while
they should not. That makes me realize that your variant should not be
affected by keeping the original one here and that it means you'd just
have one extra function to add samples over a partial range of samples,
just like we have swrate_add_scaled() that's not used anymore.
 
I'd also ask you to split your patch in 2 :

  - one which improves the freq_ctr with the new function(s)
  - one which adds the new metric, its update and reporting to the
    various call places.

In addition, I think there will be less changes with a separate metric.

Thanks!
Willy

Reply via email to