Hi Emeric,
Our process limit in QAT configuration is quite high (128) and I was
able to run 100+ openssl processes without a problem. According to Joel
from Intel problem is in cleanup code - presumably when HAProxy exits
and frees up QAT resources. Will try to see if I can get more debug
information.
Regards,
Marcin Deranek
On 4/9/19 5:17 PM, Emeric Brun wrote:
Hi Marcin,
On 4/9/19 3:07 PM, Marcin Deranek wrote:
Hi Emeric,
I have followed all instructions and I got to the point where HAProxy starts and does the job using
QAT (backend healthchecks work and I frontend can provide content over HTTPS). The problems starts
when HAProxy gets reloaded. With our current configuration on reload old HAProxy processes do not
exit, so after reload you end up with 2 generations of HAProxy processes: before reload and after
reload. I tried to find out what are conditions in which HAProxy processes get "stuck"
and I was not able to replicate it consistently. In one case it was related to amount of backend
servers with 'ssl' on their line, but trying to add 'ssl' to some other servers in other place had
no effect. Interestingly in some cases for example with simple configuration (1 frontend + 1
backend) HAProxy produced errors on reload (see attachment) - in those cases processes rarely got
"stuck" even though errors were present.
/dev/qat_adf_ctl is group writable for the group HAProxy runs on. Any help to
get this fixed / resolved would be welcome.
Regards,
Marcin Deranek
I've checked the errors.txt and all the messages were written by the engine and
are not part of the haproxy code. I can only do supposition for now but I think
we face a first error due to a limitation of the amount of processes trying to
access the engine: the reload will double the number of processes trying to
attach the engine. Perhaps this issue can be bypassed tweaking the qat
configuration file (some advise, from intel would be wellcome).
For the old stucked processes: I think the grow of processes also triggers
errors on already attached ones in the qat engine but currently I ignore the
way this errors are/should be raised to the application, it appears that they
are currently not handled and that's why processes would be stuck (sessions may
appear still valid for haproxy so the old process continues to wait for their
end). We expected they were raised by the openssl API but it appears to not be
the case. We have to check if we miss to handle an error polling events on the
file descriptor used to communicate with engine.
So we have to dig deeper and any help from Intel's guy or Qat aware devs will
be appreciate.
Emeric