Hi Sylvain,

I believe we had a similar issue in our configuration. I can dig in more 
tomorrow but, we had deadlocks with the retroCL plugin. 

If you follow the steps outlined on this page, 
https://directory.fedoraproject.org/docs/389ds/FAQ/faq.html#debug_hangs to get 
a stack trace, I can try to see if you're hitting the same thing. 

See https://bugzilla.redhat.com/show_bug.cgi?id=1751295 for some more details 
on the issue. 

Hope that helps,
Jared

On Sun, Oct 20, 2019, at 3:55 PM, Sylvain Coutant via FreeIPA-users wrote:
> Hello gurus,
> 
> We are running a 3 nodes FreeIPA cluster for some time without major trouble. 
> One server may stale from time to time, without real trouble to restart it.
> 
> A few days ago, we had to migrate the VMs between two clouds (disk image 
> copied from one to the other). They have been renumbered from old to new IPv4 
> address space. Not that easy, but we finally got it done with all DNS entries 
> in sync. Yet, since the migration, ns-slapd process hangs randomly way more 
> often than before (went from once every few months to several times a day) 
> and is especially hard to restart on any node.
> 
> While starting up, the netstat output is like:
> 
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
> tcp6 184527 0 10.217.151.3:389 10.217.151.2:52314 ESTABLISHED 29948/ns-slapd 
> 
> Netstat and tcpdump show it processes very slowly the recvq (sometimes like 
> 79 bytes per 1-2 seconds). At some point it just stops processing it and 
> hangs (only kill -9 works to take it down). When stale, strace shows the 
> process loops only on :
> 
> getpeername(8, 0x7ffe62c49fd0, 0x7ffe62c49f94) = -1 ENOTCONN (Transport 
> endpoint is not connected)
> poll([{fd=50, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, 
> {fd=9, events=POLLIN}, {fd=117, events=POLLIN}, {fd=116, events=POLLIN}, 
> {fd=115, events=POLLIN}, {fd=114, events=POLLIN}, {fd=89, events=POLLIN}, 
> {fd=85, events=POLLIN}, {fd=83, events=POLLIN}, {fd=82, events=POLLIN}, 
> {fd=81, events=POLLIN}, {fd=80, events=POLLIN}, {fd=79, events=POLLIN}, 
> {fd=78, events=POLLIN}, {fd=77, events=POLLIN}, {fd=76, events=POLLIN}, 
> {fd=67, events=POLLIN}, {fd=72, events=POLLIN}, {fd=69, events=POLLIN}, 
> {fd=64, events=POLLIN}, {fd=66, events=POLLIN}], 23, 250) = 0 (Timeout)
> 
> If it can go through startup replication, one of the server will hang a 
> little bit later, freezing the whole cluster. Forcing us to restart the 
> faulty node to unlock things.
> 
> When stale, the dirsrv access log only contains entries like:
> [20/Oct/2019:17:52:46.950029525 +0100] conn=86 fd=131 slot=131 connection 
> from 10.217.151.4 to 10.217.151.4
> [20/Oct/2019:17:52:51.280412883 +0100] conn=87 fd=132 slot=132 SSL connection 
> from 10.217.151.10 to 10.217.151.4
> [20/Oct/2019:17:52:54.956204031 +0100] conn=88 fd=133 slot=133 connection 
> from 10.217.151.4 to 10.217.151.4
> [20/Oct/2019:17:53:04.966542441 +0100] conn=89 fd=134 slot=134 connection 
> from 10.217.151.2 to 10.217.151.4
> [20/Oct/2019:17:53:22.659053020 +0100] conn=90 fd=135 slot=135 SSL connection 
> from 10.217.151.10 to 10.217.151.4
> [20/Oct/2019:17:53:51.006707605 +0100] conn=91 fd=136 slot=136 connection 
> from 10.217.151.4 to 10.217.151.4
> [20/Oct/2019:17:53:54.514162543 +0100] conn=92 fd=137 slot=137 SSL connection 
> from 10.217.151.10 to 10.217.151.4
> [20/Oct/2019:17:53:59.011602776 +0100] conn=93 fd=138 slot=138 connection 
> from 10.217.151.3 to 10.217.151.4
> [20/Oct/2019:17:54:09.019296900 +0100] conn=94 fd=139 slot=139 connection 
> from 10.217.151.4 to 10.217.151.4
> 
> And netstat lists 10s of accepted network connections that are stale like :
> tcp6 286 0 10.217.151.4:389 10.217.151.10:32512 ESTABLISHED 29948/ns-slapd 
> 
> 
> The underlying network seams clean and uses jumbo frames. tcpdump and ping 
> show 0 packet loss and no retransmit. Being afraid it could be a jumbo frame 
> issue, mtu was even forced down to 1500. Without success.
> 
> Entropy seems fine as well :
> # cat /proc/sys/kernel/random/entropy_avail 
> 3138
> 
> Running version on all servers:
> ipa-client-4.6.5-11.el7.centos.x86_64
> ipa-client-common-4.6.5-11.el7.centos.noarch
> ipa-common-4.6.5-11.el7.centos.noarch
> ipa-server-4.6.5-11.el7.centos.x86_64
> ipa-server-common-4.6.5-11.el7.centos.noarch
> ipa-server-dns-4.6.5-11.el7.centos.noarch
> 
> 
> I'd happily listen to any hint regarding this critical problem.
> 
> /Sylvain.
> _______________________________________________
> FreeIPA-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> Fedora Code of Conduct: 
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedorahosted.org/archives/list/[email protected]
> 
_______________________________________________
FreeIPA-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/[email protected]

Reply via email to