Re: QAT intermittent healthcheck errors

Emeric Brun Tue, 09 Apr 2019 08:18:06 -0700

Hi Marcin,

On 4/9/19 3:07 PM, Marcin Deranek wrote:
> Hi Emeric,
> 
> I have followed all instructions and I got to the point where HAProxy starts 
> and does the job using QAT (backend healthchecks work and I frontend can 
> provide content over HTTPS). The problems starts when HAProxy gets reloaded. 
> With our current configuration on reload old HAProxy processes do not exit, 
> so after reload you end up with 2 generations of HAProxy processes: before 
> reload and after reload. I tried to find out what are conditions in which 
> HAProxy processes get "stuck" and I was not able to replicate it 
> consistently. In one case it was related to amount of backend servers with 
> 'ssl' on their line, but trying to add 'ssl' to some other servers in other 
> place had no effect. Interestingly in some cases for example with simple 
> configuration (1 frontend + 1 backend) HAProxy produced errors on reload (see 
> attachment) - in those cases processes rarely got "stuck" even though errors 
> were present.
> /dev/qat_adf_ctl is group writable for the group HAProxy runs on. Any help to 
> get this fixed / resolved would be welcome.
> Regards,
> 
> Marcin Deranek


I've checked the errors.txt and all the messages were written by the engine and 
are not part of the haproxy code. I can only do supposition for now but I think 
we face a first error due to a limitation of the amount of processes trying to 
access the engine: the reload will double the number of processes trying to 
attach the engine. Perhaps this issue can be bypassed tweaking the qat 
configuration file (some advise, from intel would be wellcome).

For the old stucked processes: I think the grow of processes also triggers 
errors on already attached ones in the qat engine but currently I ignore the 
way this errors are/should be raised to the application, it appears that they 
are currently not handled and that's why processes would be stuck (sessions may 
appear still valid for haproxy so the old process continues to wait for their 
end). We expected they were raised by the openssl API but it appears to not be 
the case. We have to check if we miss to handle an error  polling events on the 
file descriptor used to communicate with engine.


So we have to dig deeper and any help from Intel's guy or Qat aware devs will 
be appreciate.

Emeric

Re: QAT intermittent healthcheck errors

Reply via email to