On Срд, 09 жні 2023, Harry G Coin wrote:

On 8/9/23 01:00, Alexander Bokovoy wrote:
On Аўт, 08 жні 2023, Harry G Coin wrote:
Thanks for your help.  Details below.  The problem 'moved' in I hope a diagnositcally useful way, but the system remains broken.

On 8/8/23 08:54, Alexander Bokovoy wrote:
On Аўт, 08 жні 2023, Harry G Coin wrote:

On 8/8/23 02:43, Alexander Bokovoy wrote:
pstack $(pgrep ns-slapd)  > ns-slapd log
Tried an upgrade from 4.9.10 to 4.9.11, the "writeback to ldap failed" error moved from the primary instance (on which the dns records were being added) to the replica which hung in the same fashion.   Here's the log you asked for from attempting 'systemctl restart dirsrv@...'  it just hangs at 100% cpu for about 10 minutes.

Thank you. Are you using schema compat for some legacy clients?


This is a fresh install of 4.9.10 about a week ago, upgraded to 4.9.11 yesterday, just two freeipa instances and no appreciable user load, using the install defaults.  The 'in house' system then starts loading lots of dns records via the python ldap2 interface on the first of two systems installed, the replica produced what you see in this post.   There is no 'private' information involved of any sort, it's supposed to field DNS calls from the public but was so unreliable I had to implement unbound on other servers, so all freeipa does is IXFR to unbound for the heavy load.  I suppose there may be <16 other in-house lab systems, maybe 2 or 3 with any activity, that use it for dns.   The only other clue is these are running on VMs in older servers and have no other software packages installed other than freeipa and what freeipa needs to run, and the in-house program that loads the dns.

Just to exclude potential problems with schema compat, it can be
disabled if you are not using it.

How?  The installs just use all the defaults, other than enabling dnssec and PTR records for all a/aaaa.

I'm officially in 'desperation mode' as not being able to populate DNS in freeipa reduces everyone to pencil and paper and coffee with full project stoppage until it's fixed or at least 'worked around'.   So anything that 'might help' can be sacrificed so at least 'something' works 'somewhat'.   If old AD needs to be 'broken' or 'off' but mostly the rest of it 'works sort of' then how do I do it?

Really this can't be hard to reproduce, it's just two instances with a 1G link between them, each with a pair of old rusty hard drives in an lvm mirror using a COW file system, dnssec on, and one of them loading lots of dns with reverse pointers for each A/AAAA with maybe 200 to 600 PTR records per *arpa and maybe 10-200 records per subdomain, maybe 200 domains total.    A couple python for loops and hey presto you'll see freeipa lock up without notice in your lab as well.  I just can't imagine causing these race conditions to appear in the case of the only important load being DNS adds/finds/shows should be difficult.

I appreciate the help, and have become officially fearful about freeipa.  Maybe it's seldom used extensively for DNS and so my use case is an outlier?   Why are so few seeing this?  It's a fully default package install, no custom changes to the OS, freeipa, other packages.   I don't get it.

We don't see these problems in our tests, may be we are doing something
different.

As I said, disabling compat tree should help to avoid any potential
issues related to that plugin. See ipa-compat-manage and ipa-nis-manage
commands, they switch on/off individually both slapi-nis plugins.



Thanks for any leads or help!



I don't think it is about named per se, it is a bit of an unfortunate
interop inside ns-slapd between different plugins. bind-dyndb-ldap
relies on the syncrepl extension which implementation in ns-slapd is
using the retro changelog content. Retro changelog plugin triggers some
updates that cause schema compatibility plugin to lock itself up
depending on the order of updates that retro changelog would capture. We
fixed that in slapi-nis package some time ago and it *should* be
ignoring the retro changelog changes but somehow they still propagate
into it. There are few places in ns-slapd which were addressed just
recently and those updates might help (out later this year in RHEL).
Disabling schema compat would be the best.

What's worse, every reboot attempt waits the full '9 min 29 secs' before systemd forcibly terminates ns-slapd to finish the 'stop job'.

That's why I'm so troubled by all this, it's not like there is any interference from anything other than what freeipa puts out there, and it just locks with a message that gives no indication of what to do about it, with nothing in any logs and 'systemctl is-system-running' reports 'running'.

You could easily replicate this:  imagine a simple validation test that sets up two freeipa nodes, turns on dnssec, creates some domains, then adds A AAAA and *.arpa records using the ldap2 api on one of the nodes.  Maybe limit the net speed between the nodes to a 1GB link typical, maybe at most 4 processor cores of some older vintage and 5GB memory.  It takes less than 2 minutes after dns load start to lock up.

What's really odd is bind9 / named keeps blasting out change notifications for some of the updated domains, then a few lines later, with no intervening activity in any log or by any program affecting the zone, will publish further change notifications with a new serial number for the same zone.  This happens for all the zones that get modifications.  I'm thinking 'rr' computations?  I wonder if those entries-- being auto-generated internally -- are creating a 'flow control' issue between the primary and replica.

This is something that retro changelog is responsible for as it is the
data store used by the syncrepl protocol implementation. If these
'changes' appear again and again, it means retro changelog plugin marks
them as new for this particular syncrepl client (bind-dyndb-ldap).

All threads other than the thread 30 are normal ones (idle threads) but
this one blocks the database backend in the log flush sequence while
writing the retro changelog entry for this updated DNS record:

Thread 30 (Thread 0x7f0e583ff700 (LWP 1438)):
#0  0x00007f0e9bf7d8af in fdatasync () at target:/lib64/libc.so.6
#1  0x00007f0e91cbe6b5 in __os_fsync () at target:/lib64/libdb-5.3.so
#2  0x00007f0e91ca598c in __log_flush_int () at target:/lib64/libdb-5.3.so
#3  0x00007f0e91ca7dd0 in __log_flush () at target:/lib64/libdb-5.3.so
#4  0x00007f0e91ca7f73 in __log_flush_pp () at target:/lib64/libdb-5.3.so #5  0x00007f0e8afe1304 in bdb_txn_commit (li=<optimized out>,  txn=0x7f0e583fd028, use_lock=1) at ldap/servers/slapd/back-ldbm/db-bdb/bdb_layer.c:2772 #6  0x00007f0e8af95515 in dblayer_txn_commit (be=0x7f0e88424f00, txn=<optimized out>) at ldap/servers/slapd/back-ldbm/dblayer.c:736 #7  0x00007f0e8afa7ebe in ldbm_back_add (pb=0x7f0e85748860) at ldap/servers/slapd/back-ldbm/ldbm_add.c:1242 #8  0x00007f0e9d7d7728 in op_shared_add (pb=pb@entry=0x7f0e85748860) at ldap/servers/slapd/add.c:692 #9  0x00007f0e9d7d7bbe in add_internal_pb (pb=pb@entry=0x7f0e85748860) at ldap/servers/slapd/add.c:407 #10 0x00007f0e9d7d8975 in slapi_add_internal_pb (pb=pb@entry=0x7f0e85748860) at ldap/servers/slapd/add.c:331 #11 0x00007f0e8960f8bf in write_replog_db (newsuperior=0x0, modrdn_mods=0x0, newrdn=0x0, post_entry=<optimized out>, log_e=0x7f0e4df5b9c0, curtime=1691511446, flag=0, log_m=0x7f0e5ec0d440, dn=0x7f0e57a40740 "idnsname=8.0.f.0.0.0.0.0.0.0.0.1.0.0.c.f.ip6.arpa.,cn=dns,dc=1,dc=quietfountain,dc=com", optype=<optimized out>, pb=0x7f0e66a09580) at ldap/servers/plugins/retrocl/retrocl_po.c:369 #12 0x00007f0e8960f8bf in retrocl_postob (pb=0x7f0e66a09580, optype=<optimized out>) at ldap/servers/plugins/retrocl/retrocl_po.c:697 #13 0x00007f0e9d83cc79 in plugin_call_func (list=0x7f0e924aae00, operation=operation@entry=561, pb=pb@entry=0x7f0e66a09580, call_one=call_one@entry=0) at ldap/servers/slapd/plugin.c:2032 #14 0x00007f0e9d83cec4 in plugin_call_list (pb=0x7f0e66a09580, operation=561, list=<optimized out>) at ldap/servers/slapd/plugin.c:1973 #15 0x00007f0e9d83cec4 in plugin_call_plugins (pb=pb@entry=0x7f0e66a09580, whichfunction=whichfunction@entry=561) at ldap/servers/slapd/plugin.c:442 #16 0x00007f0e8afc3658 in ldbm_back_modify (pb=<optimized out>) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:1002 #17 0x00007f0e9d828300 in op_shared_modify (pb=pb@entry=0x7f0e66a09580, pw_change=pw_change@entry=0, old_pw=0x0) at ldap/servers/slapd/modify.c:1025 #18 0x00007f0e9d829a00 in do_modify (pb=pb@entry=0x7f0e66a09580) at ldap/servers/slapd/modify.c:380 #19 0x0000564ed703475b in connection_dispatch_operation (pb=0x7f0e66a09580, op=<optimized out>, conn=<optimized out>) at ldap/servers/slapd/connection.c:651 #20 0x0000564ed703475b in connection_threadmain (arg=<optimized out>) at ldap/servers/slapd/connection.c:1803
#21 0x00007f0e9a24b968 in _pt_root () at target:/lib64/libnspr4.so
#22 0x00007f0e99be61ca in start_thread () at target:/lib64/libpthread.so.0
#23 0x00007f0e9be90e73 in clone () at target:/lib64/libc.so.6

Mark, Thierry, any hints here? (For full trace see thread
https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org/thread/TMRXHCORFU3QRQL6FSZTS4OIHYOAVXWF/)






--
/ Alexander Bokovoy
Sr. Principal Software Engineer
Security / Identity Management Engineering
Red Hat Limited, Finland
_______________________________________________
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

Reply via email to