On 26/07/2016 06:56 μμ, Willy Tarreau wrote:
> On Tue, Jul 26, 2016 at 05:51:08PM +0200, Pavlos Parissis wrote:
>> In all my setups I have nbproc > 1 and after a lot of changes and on how I 
>> aggregate HAProxy
>> stats and what most people want to see on graphs, I came up that with 
>> something like the following:
>>
>> {
>>     "frontend": {
>>         "www.haproxy.org": {
>>             "bin": "999999999999",
>>             "lbtot": "555555",
>>             ...
>>         },
>>         "www.haproxy.com": {
>>             "bin": "999999999999",
>>             "lbtot": "555555",
>>             ...
>>         },
>>     },
>>     "backend": {
>>         "www.haproxy.org": {
>>             "bin": "999999999999",
>>             "lbtot": "555555",
>>             ....
>>             "server": {
>>                 "srv1": {
>>                     "bin": "999999999999",
>>                     "lbtot": "555555",
>>                    ....
>>                 },
>>                 ...
>>             },
>>         },
>>     },
>>     "haproxy": {
>>         "PipesFree": "555",
>>             ...
>>         ,
>>         "per_process": {
>>             "id1": {
>>                 "PipesFree": "555",
>>                 "Process_num": "1",
>>                 ...
>>             },
>>             "id2": {
>>                 "PipesFree": "555",
>>                 "Process_num": "2",
>>                 ...
>>             },
>>             ...
>>         },
>>     },
>>     "server": {
>>         "srv1": {
>>             "bin": "999999999999",
>>             "lbtot": "555555",
>>             ...
>>         },
>>         ...
>>     },
>> }
>>
>>
>> Let me explain a bit:
>>
>> - It is very useful and handy to know stats for a server per backend but 
>> also across all
>> backends. Thus, I include a top level key 'server' which holds stats for 
>> each server across all
>> backends. Few server's stats has to be excluded as they are meaningless in 
>> this context.
>> For example, status, lastchg, check_duration, check_code and few others. For 
>> those which aren't
>> counters but fixed numbers you want to either sum them(slim) or get the 
>> average(weight). I
>> don't do the latter in my setup.
> 
> You probably have not looked at the output of "show stats typed", it
> gives you the nature of each value letting you know how to aggregate
> them (min, max, avg, sum, pick any, etc).
> 

I have seen it but it isn't available on 1.6. It could simplify my code, I 
should give a try.

>> - Aggregation across multiple processes for haproxy stats(show info output)
> 
> It's not only "show info", this one reports only the process health.
> 
>> As you can see I provide stats per process and across all processes.
>> It has been proven very useful to know the CPU utilization per process. We 
>> depend on the kernel
>> to do the distribution of incoming connects to all processes and so far it 
>> works very well, but
>> sometimes you see a single process to consume a lot of CPU and if you don't 
>> provide percentiles
>> or stats per process then you are going to miss it. The metrics about 
>> uptime, version,
>> description and few other can be excluded in the aggregation.
> 
> These last ones are in the "pick any" type of aggregation I was talking about.
> 
>> - nbproc > 1 and aggregation for frontend/backend/server
>> My proposal doesn't cover stats for frontend/backend/server per haproxy 
>> process.
> 
> But that's precisely the limitation I'm reporting :-)
> 
>> The stats are already aggregated and few metrics are excluded. For example 
>> all status stuff.
>> Each process performs healthchecking, so they act as little brains which 
>> never agree on the
>> status of a server as they run their checks on different interval.
> 
> Absolutely, but at least you want to see their stats. For example how many
> times a server has switched state per process then in total (meaning a
> proportional amount of possibly visible issues).
> 

True, but in setups with ECMP in front of N HAProxy nodes which run in nbproc 
mode you offload
application healthchecking to a dedicated daemon which runs on servers(service
discovery+service availability with consul/zookeeper stuff) and you only run 
TCP checks
from HAProxy.

In our setup we don't real care about how many times a server flapped, it 
doesn't tell us
something we don't know already, application is in broken state.

But, other people may find it useful.

> My issue is that if the *format* doesn't support per-process stats, we'll have
> to emit a new format 3 months later for all the people who want to process it.
> We've reworked the stats dump to put an end to the problem where depending on
> the output format you used to have different types of information, and there
> was no single representation carrying them all at once. For me now it's
> essential that if we prepare a new format it's not stripped down from the
> info people need, otherwise it will automatically engender yet another format.
> 

Agree. I am fine giving per process stats for servers/frontends/backends.
Adding another top level key 'per_process' in my proposal should be a good 
start:

{
    "per_process": {
        "proc1": {
            "frontend": {
                "www.haproxy.org": {
                    "bin": "999999999999",
                    "lbtot": "555555",
                    ...
                },
                "www.haproxy.com": {
                    "bin": "999999999999",
                    "lbtot": "555555",
                    ...
                },
            },
            "backend": {
                "www.haproxy.org": {
                    "bin": "999999999999",
                    "lbtot": "555555",
                    ....
                    "server": {
                        "srv1": {
                            "bin": "999999999999",
                            "lbtot": "555555",
                        ....
                        },
                        ...
                    },
                },
            },
            "haproxy": {
                "PipesFree": "555",
                ...
            },
            "server": {
                "srv1": {
                    "bin": "999999999999",
                    "lbtot": "555555",
                    ...
                },
                ...
            },
        },
        ...
    },
    "frontend": {
        "www.haproxy.org": {
            "bin": "999999999999",
            "lbtot": "555555",
            ...
        },
        "www.haproxy.com": {
            "bin": "999999999999",
            "lbtot": "555555",
            ...
        },
    },
    "backend": {
        "www.haproxy.org": {
            "bin": "999999999999",
            "lbtot": "555555",
            ....
            "server": {
                "srv1": {
                    "bin": "999999999999",
                    "lbtot": "555555",
                   ....
                },
                ...
            },
        },
    },
    "haproxy": {
        "PipesFree": "555",
            ...
        },
    },
    "server": {
        "srv1": {
            "bin": "999999999999",
            "lbtot": "555555",
            ...
        },
        ...
    },
}

What do you think?

Cheers,
Pavlos


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to