Re: QAT intermittent healthcheck errors

Marcin Deranek Fri, 12 Apr 2019 09:10:39 -0700

Hi Emeric,

On 4/12/19 5:26 PM, Emeric Brun wrote:

Do you have ssl enabled on the server side?


Yes, ssl is on frontend and backend with ssl checks enabled.

If it is the case could replace health check with a simple tcp check (without 
ssl)?

What I noticed before that if I (re)start HAProxy and reload immediatelyno stuck processes are present. If I wait before reloading stuckprocesses show up.After disabling checks (I still keep ssl enabled for normal traffic)reloads work just fine (tried many time). Do you know how to enable TCPhealthchecks while keeping SSL for non-healthcheck requests ?

Regarding the show info/lsoff  it seems there is no more sessions on client 
side but remaining ssl jobs (CurrSslConns) and I supsect the health checks to 
miss a cleanup of their ssl sessions using the QAT. (this is just an assumption)

In general instance where I test QAT does not have any "real" clienttraffic except small amount of healtcheck requests per frontend whichare internally handled by HAProxy itself. Still TLS handshake stillneeds to take place. There are many more backend healthchecks. Lookslike your assumption was correct..

Regards,

Marcin Deranek

On 4/12/19 4:43 PM, Marcin Deranek wrote:

Hi Emeric,

On 4/10/19 2:20 PM, Emeric Brun wrote:

On 4/10/19 1:02 PM, Marcin Deranek wrote:

Hi Emeric,

Our process limit in QAT configuration is quite high (128) and I was able to 
run 100+ openssl processes without a problem. According to Joel from Intel 
problem is in cleanup code - presumably when HAProxy exits and frees up QAT 
resources. Will try to see if I can get more debug information.


I've just take a look.

Engines deinit ar called:

haproxy/src/ssl_sock.c
#ifndef OPENSSL_NO_ENGINE
void ssl_free_engines(void) {
          struct ssl_engine_list *wl, *wlb;
          /* free up engine list */
          list_for_each_entry_safe(wl, wlb, &openssl_engines, list) {
                  ENGINE_finish(wl->e);
                  ENGINE_free(wl->e);
                  LIST_DEL(&wl->list);
                  free(wl);
          }
}
#endif
...
#ifndef OPENSSL_NO_ENGINE
          hap_register_post_deinit(ssl_free_engines);
#endif

I don't know how many haproxy processes you are running but if I describe the 
complete scenario of processes you may note that we reach a limit:


It's very unlikely it's the limit as I lowered number of HAProxy processes 
(from 10 to 4) while keeping QAT NumProcesses equal 32. HAProxy would have 
problem with this limit while spawning new instances and not tearing down old 
ones. In such a case QAT would not be initialized for some HAProxy instances 
(you would see 1 thread vs 2 thread). About threads read below.

- the master sends a signal to older processes, those process will unbind and 
stop to accept new conns but continue to serve remaining sessions until the end.
- new processes are started and immediately and init the engine and accept 
newconns.
- When no more sessions remains on an old process, it calls the deinit function 
of the engine before exiting


What I noticed is that each HAProxy with QAT enabled has 2 threads (LWP) - 
looks like QAT adds extra thread to the process itself. Would adding extra 
thread possibly mess up HAProxy termination sequence ?
Our setup is to run HAProxy in multi process mode - no threads (or 1 thread per 
process if you wish).

I'm also supposed that old processes are stucked because there is some sessions 
which never ended, perhaps I'm wrong but a strace on an old process
could be interesting to know why those processes are stucked.


strace only shows these:

[pid 11392] 23:24:43.164619 epoll_wait(4,  <unfinished ...>
[pid 11392] 23:24:43.164687 <... epoll_wait resumed> [], 200, 0) = 0
[pid 11392] 23:24:43.164761 epoll_wait(4,  <unfinished ...>
[pid 11392] 23:24:43.953203 <... epoll_wait resumed> [], 200, 788) = 0
[pid 11392] 23:24:43.953286 epoll_wait(4,  <unfinished ...>
[pid 11392] 23:24:43.953355 <... epoll_wait resumed> [], 200, 0) = 0
[pid 11392] 23:24:43.953419 epoll_wait(4,  <unfinished ...>
[pid 11392] 23:24:44.010508 <... epoll_wait resumed> [], 200, 57) = 0
[pid 11392] 23:24:44.010589 epoll_wait(4,  <unfinished ...>

There are no connections: stucked process only has UDP socket on random port:

[root@externallb-124 ~]# lsof -p 6307|fgrep IPv4
hapee-lb 6307 lbengine   83u     IPv4         3598779351      0t0 UDP *:19573

You can also use the 'master CLI' using '-S' and you could check if it remains 
sessions on those older processes (doc is available in management.txt)


Before reload
* systemd
  Main PID: 33515 (hapee-lb)
    Memory: 1.6G
    CGroup: /system.slice/hapee-1.8-lb.service
            ├─33515 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
            ├─34858 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
            ├─34859 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
            ├─34860 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
            └─34861 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
* master CLI
show proc
#<PID>          <type>          <relative PID>  <reloads>       <uptime>
33515           master          0               0               0d 00h00m31s
# workers
34858           worker          1               0               0d 00h00m31s
34859           worker          2               0               0d 00h00m31s
34860           worker          3               0               0d 00h00m31s
34861           worker          4               0               0d 00h00m31s

After reload:
* systemd
  Main PID: 33515 (hapee-lb)
    Memory: 3.1G
    CGroup: /system.slice/hapee-1.8-lb.service
            ├─33515 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 -sf 34858 
34859 34860 34861 -x /run/lb_engine/process-1.sock
            ├─34858 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
            ├─34859 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
            ├─34860 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
            ├─34861 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234
            ├─41871 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 -sf 34858 
34859 34860 34861 -x /run/lb_engine/process-1.sock
            ├─41872 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 -sf 34858 
34859 34860 34861 -x /run/lb_engine/process-1.sock
            ├─41873 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 -sf 34858 
34859 34860 34861 -x /run/lb_engine/process-1.sock
            └─41874 /opt/hapee-1.8/sbin/hapee-lb -Ws -f 
/etc/lb_engine/haproxy.cfg -p /run/hapee-lb.pid -S 127.0.0.1:1234 -sf 34858 
34859 34860 34861 -x /run/lb_engine/process-1.sock
* master CLI
show proc
#<PID>          <type>          <relative PID>  <reloads>       <uptime>
33515           master          0               1               0d 00h01m33s
# workers
41871           worker          1               0               0d 00h00m45s
41872           worker          2               0               0d 00h00m45s
41873           worker          3               0               0d 00h00m45s
41874           worker          4               0               0d 00h00m45s
# old workers
34858           worker          [was: 1]        1               0d 00h01m33s
34859           worker          [was: 2]        1               0d 00h01m33s
34860           worker          [was: 3]        1               0d 00h01m33s
34861           worker          [was: 4]        1               0d 00h01m33s

and

@!34858 show info
Name: HAProxy
Version: 1.8.0-2.0.0-195.793
Release_date: 2019/03/19
Nbthread: 1
Nbproc: 4
Process_num: 1
Pid: 34858
Uptime: 0d 0h03m24s
Uptime_sec: 204
Memmax_MB: 0
PoolAlloc_MB: 1
PoolUsed_MB: 1
PoolFailed: 0
Ulimit-n: 2006423
CurrConns: 0
CumConns: 354
CumReq: 342
CurrSslConns: 20
CumSslConns: 35928
Maxpipes: 0
PipesUsed: 0
PipesFree: 0
ConnRate: 0
ConnRateLimit: 0
MaxConnRate: 65
SessRate: 0
SessRateLimit: 0
MaxSessRate: 62
SslRate: 0
SslRateLimit: 0
MaxSslRate: 52
SslFrontendKeyRate: 0
SslFrontendMaxKeyRate: 52
SslFrontendSessionReuse_pct: 0
SslBackendKeyRate: 0
SslBackendMaxKeyRate: 2988
SslCacheLookups: 0
SslCacheMisses: 0
CompressBpsIn: 0
CompressBpsOut: 0
CompressBpsRateLim: 0
Tasks: 5849
Run_queue: 1
Idle_pct: 100
Stopping: 1
Jobs: 25
Unstoppable Jobs: 4
Listeners: 4
DroppedLogs: 0

Regards,

Marcin Deranek

Re: QAT intermittent healthcheck errors

Reply via email to