>> I checked the archives and found a patch from some time ago that was
>> never merged.  It wasn't verified to resolve the "pause timeout" problem
>> but t could indeed solve the problem.  It wasn't merged because we
>> lacked verification it resolved the problem.
> 
> Great, I'll try it in next few days, good news is that problem should be
> easily reproducible.

Hmm...
Not so easily...

I applied that patch to all physical hosts, and do not see that message
any more for two days, independently of number of RX buffers in adapter.

But, I do not see it if I downgrade to previous image (without that
patch) :( Although I did not test it again for a long time, only several
hours.

I didn't apply patch to VM, and do not see that message either.
What I did also:
* Rescheduled VM to higher CPU priority (actually real-time)
* Assigned higher blkio priority to that VM
* Assigned low blkio priority to bulk resources on node where that VM runs.
So, original problem seems to have different causes for bare-metal and
VM cases.

For former case patch seems to be helpful.
It should help for VM case too.

There were lots of '[TOTEM ] Retransmit List:' messages on bare-metal
hosts until I returned eth RX ring size back to 256 buffers (from 4096).
After some thinking, this is probably correct, because more buffers add
some latency, which is bad for corosync. Not sure why that may affect
NAPI polling rate although.

I'll try to upgrade igb driver (newer version has tuning param
InterruptThrottleRate) and play again with ring buffers and that rate.

Again, that driver version I currently have may have some bugs when
operating with big buffer rings which lead to 500ms blocking under high
load.

BTW are that Retransmit List: messages harmful?


Best,
Vladislav
_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to