Hi Sylvain, I believe we had a similar issue in our configuration. I can dig in more tomorrow but, we had deadlocks with the retroCL plugin.
If you follow the steps outlined on this page, https://directory.fedoraproject.org/docs/389ds/FAQ/faq.html#debug_hangs to get a stack trace, I can try to see if you're hitting the same thing. See https://bugzilla.redhat.com/show_bug.cgi?id=1751295 for some more details on the issue. Hope that helps, Jared On Sun, Oct 20, 2019, at 3:55 PM, Sylvain Coutant via FreeIPA-users wrote: > Hello gurus, > > We are running a 3 nodes FreeIPA cluster for some time without major trouble. > One server may stale from time to time, without real trouble to restart it. > > A few days ago, we had to migrate the VMs between two clouds (disk image > copied from one to the other). They have been renumbered from old to new IPv4 > address space. Not that easy, but we finally got it done with all DNS entries > in sync. Yet, since the migration, ns-slapd process hangs randomly way more > often than before (went from once every few months to several times a day) > and is especially hard to restart on any node. > > While starting up, the netstat output is like: > > Active Internet connections (w/o servers) > Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name > tcp6 184527 0 10.217.151.3:389 10.217.151.2:52314 ESTABLISHED 29948/ns-slapd > > Netstat and tcpdump show it processes very slowly the recvq (sometimes like > 79 bytes per 1-2 seconds). At some point it just stops processing it and > hangs (only kill -9 works to take it down). When stale, strace shows the > process loops only on : > > getpeername(8, 0x7ffe62c49fd0, 0x7ffe62c49f94) = -1 ENOTCONN (Transport > endpoint is not connected) > poll([{fd=50, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, > {fd=9, events=POLLIN}, {fd=117, events=POLLIN}, {fd=116, events=POLLIN}, > {fd=115, events=POLLIN}, {fd=114, events=POLLIN}, {fd=89, events=POLLIN}, > {fd=85, events=POLLIN}, {fd=83, events=POLLIN}, {fd=82, events=POLLIN}, > {fd=81, events=POLLIN}, {fd=80, events=POLLIN}, {fd=79, events=POLLIN}, > {fd=78, events=POLLIN}, {fd=77, events=POLLIN}, {fd=76, events=POLLIN}, > {fd=67, events=POLLIN}, {fd=72, events=POLLIN}, {fd=69, events=POLLIN}, > {fd=64, events=POLLIN}, {fd=66, events=POLLIN}], 23, 250) = 0 (Timeout) > > If it can go through startup replication, one of the server will hang a > little bit later, freezing the whole cluster. Forcing us to restart the > faulty node to unlock things. > > When stale, the dirsrv access log only contains entries like: > [20/Oct/2019:17:52:46.950029525 +0100] conn=86 fd=131 slot=131 connection > from 10.217.151.4 to 10.217.151.4 > [20/Oct/2019:17:52:51.280412883 +0100] conn=87 fd=132 slot=132 SSL connection > from 10.217.151.10 to 10.217.151.4 > [20/Oct/2019:17:52:54.956204031 +0100] conn=88 fd=133 slot=133 connection > from 10.217.151.4 to 10.217.151.4 > [20/Oct/2019:17:53:04.966542441 +0100] conn=89 fd=134 slot=134 connection > from 10.217.151.2 to 10.217.151.4 > [20/Oct/2019:17:53:22.659053020 +0100] conn=90 fd=135 slot=135 SSL connection > from 10.217.151.10 to 10.217.151.4 > [20/Oct/2019:17:53:51.006707605 +0100] conn=91 fd=136 slot=136 connection > from 10.217.151.4 to 10.217.151.4 > [20/Oct/2019:17:53:54.514162543 +0100] conn=92 fd=137 slot=137 SSL connection > from 10.217.151.10 to 10.217.151.4 > [20/Oct/2019:17:53:59.011602776 +0100] conn=93 fd=138 slot=138 connection > from 10.217.151.3 to 10.217.151.4 > [20/Oct/2019:17:54:09.019296900 +0100] conn=94 fd=139 slot=139 connection > from 10.217.151.4 to 10.217.151.4 > > And netstat lists 10s of accepted network connections that are stale like : > tcp6 286 0 10.217.151.4:389 10.217.151.10:32512 ESTABLISHED 29948/ns-slapd > > > The underlying network seams clean and uses jumbo frames. tcpdump and ping > show 0 packet loss and no retransmit. Being afraid it could be a jumbo frame > issue, mtu was even forced down to 1500. Without success. > > Entropy seems fine as well : > # cat /proc/sys/kernel/random/entropy_avail > 3138 > > Running version on all servers: > ipa-client-4.6.5-11.el7.centos.x86_64 > ipa-client-common-4.6.5-11.el7.centos.noarch > ipa-common-4.6.5-11.el7.centos.noarch > ipa-server-4.6.5-11.el7.centos.x86_64 > ipa-server-common-4.6.5-11.el7.centos.noarch > ipa-server-dns-4.6.5-11.el7.centos.noarch > > > I'd happily listen to any hint regarding this critical problem. > > /Sylvain. > _______________________________________________ > FreeIPA-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] > Fedora Code of Conduct: > https://docs.fedoraproject.org/en-US/project/code-of-conduct/ > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: > https://lists.fedorahosted.org/archives/list/[email protected] >
_______________________________________________ FreeIPA-users mailing list -- [email protected] To unsubscribe send an email to [email protected] Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/[email protected]
