Edwin,

On 20/02/2019 13:08, Jan Friesse wrote:
Edwin Török napsal(a):
On 20/02/2019 07:57, Jan Friesse wrote:
Edwin,


On 19/02/2019 17:02, Klaus Wenninger wrote:
On 02/19/2019 05:41 PM, Edwin Török wrote:
On 19/02/2019 16:26, Edwin Török wrote:
On 18/02/2019 18:27, Edwin Török wrote:
Did a test today with CentOS 7.6 with upstream kernel and with
4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
patched [1] SBD) and was not able to reproduce the issue yet.
I was able to finally reproduce this using only upstream components
(although it seems to be easier to reproduce if we use our patched
SBD,
I was able to reproduce this by using only upstream packages
unpatched
by us):

Just out of curiosity: What did you patch in SBD?
Sorry if I missed the answer in the previous communication.

It is mostly this PR, which calls getquorate quite often (a more
efficient impl. would be to use the quorum notification API like
dlm/pacemaker do, although see concerns in
https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):

https://github.com/ClusterLabs/sbd/pull/27

We have also added our own servant for watching the health of our
control plane, but that is not relevant to this bug (it reproduces with
that watcher turned off too).


I was also able to get a corosync blackbox from one of the stuck VMs
that showed something interesting:
https://clbin.com/d76Ha

It is looping on:
debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
(non-critical): Resource temporarily unavailable (11)

Hmm ... something like tx-queue of the device full, or no buffers
available anymore and kernel-thread doing the cleanup isn't
scheduled ...

Yes that is very plausible. Perhaps it'd be nicer if corosync went back
to the epoll_wait loop when it gets too many EAGAINs from sendmsg.

But this is exactly what happens. Corosync will call sendmsg to all
active udpu members and returns back to main loop -> epoll_wait.

(although this seems different from the original bug where it got stuck
in epoll_wait)

I'm pretty sure it is.

Anyway, let's try "sched_yield" idea. Could you please try included
patch and see if it makes any difference (only for udpu)?

Thanks for the patch, unfortunately corosync still spins 106% even with
yield:
https://clbin.com/CF64x

Yep, it was kind of expected, but at lost worth a try. How does strace
look when this happens?


strace for the situation described below (corosync 95%, 1 vCPU):
https://clbin.com/hZL5z

This one again looks quite "normal". If there is problem (95% corosync CPU usage is problem), it is (or at least looks like) different one than in first report (full of epoll_wait immediately returning some fd ready).


Also Klaus had an idea to try remove sbd from the picture and try
different RR process to find out what happens. And I think it's again
worth try.

Could you please try install/enable/start
https://github.com/jfriesse/spausedd (packages built by copr are
https://copr.fedorainfracloud.org/coprs/honzaf/spausedd/),
disable/remove sbd and run your test?


Tried this with 4 vCPUs but it didn't detect any pauses (kind of
expected, plenty of other CPUs to have spausedd scheduled on even if
corosync hogs one).
Then tried with 1 vCPU and it didn't detect any pauses here either. The
95% max realtime runtime protection kicks in and limits corosync to 95%
CPU since now global 95% = this CPU's 95% since there is only one.
Interestingly corosync stays running at 95% CPU usage now, unlike in SMP
scenarios where the 95% limit was enough to avoid the situation.

Kind of expected. The reason for this test was to get sbd out of picture - I'm not convinced we were successful because it looks like we are solving different problem than before. spausedd is (mainly) for detecting overloaded vm host.

There is a visible keyboard lag over SSH, but spausedd does get scheduled:

4923 root      rt   0  213344 111036  84732 R 93.3  2.8  19:07.50
corosync

      9 root      20   0       0      0      0 S  0.3  0.0   0:02.02
ksoftirqd/0

   4256 root      rt   0   88804  35612  34368 R  0.3  0.9   0:04.36
spausedd

No SBD running, and corosync-blackbox does not work at all now.


That's bad

ifconfig eth0; sleep 1; ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
         inet 10.62.98.26  netmask 255.255.240.0  broadcast 10.62.111.255
         inet6 fd06:7768:b9e5:8c50:4c0e:daff:fe14:55f0  prefixlen 64
scopeid 0x0<global>
         inet6 fe80::4c0e:daff:fe14:55f0  prefixlen 64  scopeid 0x20<link>
         ether 4e:0e:da:14:55:f0  txqueuelen 1000  (Ethernet)
         RX packets 4929031  bytes 4644924223 (4.3 GiB)
         RX errors 0  dropped 0  overruns 0  frame 0
         TX packets 15445220  bytes 22485760076 (20.9 GiB)
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
         inet 10.62.98.26  netmask 255.255.240.0  broadcast 10.62.111.255
         inet6 fd06:7768:b9e5:8c50:4c0e:daff:fe14:55f0  prefixlen 64
scopeid 0x0<global>
         inet6 fe80::4c0e:daff:fe14:55f0  prefixlen 64  scopeid 0x20<link>
         ether 4e:0e:da:14:55:f0  txqueuelen 1000  (Ethernet)
         RX packets 4934042  bytes 4649642415 (4.3 GiB)
         RX errors 0  dropped 0  overruns 0  frame 0
         TX packets 15464987  bytes 22514562910 (20.9 GiB)
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Looking at this from the hypervisor side the VM is not sending many
packets once CPU usage goes up:
https://pasteboard.co/I246a1W.png


On another host corosync failed to start up completely (Denied
connection not ready), and:
https://clbin.com/Z35Gl
(I don't think this is related to the patch, it was doing that before
when I looked at it this morning, kernel 4.20.0 this time)

This one looks kind of normal and I'm pretty sure it's unrelated (I've
seen it already sadly never was able to find a "reliable" reproducer)

Got a fairly reliable one now, only 3/16 VMs were able to start
corosync, one got stuck with CPU usage as shown above, the others failed
to start with 'Denied connection'. >
In the past we've seen this kind of problem when dhclient happened to
take up and listen on corosync's port, but thats been fixed in recent
CentOS 7.x releases, and is not the case now.

Anything you'd like me to try help debug this?

Just to summarize how I understand current situation:
- Initial report behavior was very different from what you see now
- It looks like currently biggest problem is unability to create membership.

For now, I would suggest to try:
- corosync without sched_yield patch
- set send_join in corosync.conf to 200. It was reported that this could help to form a bigger clusters.


Honza


Thanks,
--Edwin


_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to