On Срд, 09 жні 2023, Harry G Coin wrote:
On 8/9/23 01:00, Alexander Bokovoy wrote:
On Аўт, 08 жні 2023, Harry G Coin wrote:
Thanks for your help. Details below. The problem 'moved' in I
hope a diagnositcally useful way, but the system remains broken.
On 8/8/23 08:54, Alexander Bokovoy wrote:
On Аўт, 08 жні 2023, Harry G Coin wrote:
On 8/8/23 02:43, Alexander Bokovoy wrote:
pstack $(pgrep ns-slapd) > ns-slapd log
Tried an upgrade from 4.9.10 to 4.9.11, the "writeback to ldap
failed" error moved from the primary instance (on which the
dns records were being added) to the replica which hung in the
same fashion. Here's the log you asked for from attempting
'systemctl restart dirsrv@...' it just hangs at 100% cpu for
about 10 minutes.
Thank you. Are you using schema compat for some legacy clients?
This is a fresh install of 4.9.10 about a week ago, upgraded to
4.9.11 yesterday, just two freeipa instances and no appreciable
user load, using the install defaults. The 'in house' system then
starts loading lots of dns records via the python ldap2 interface
on the first of two systems installed, the replica produced what
you see in this post. There is no 'private' information involved
of any sort, it's supposed to field DNS calls from the public but
was so unreliable I had to implement unbound on other servers, so
all freeipa does is IXFR to unbound for the heavy load. I suppose
there may be <16 other in-house lab systems, maybe 2 or 3 with any
activity, that use it for dns. The only other clue is these are
running on VMs in older servers and have no other software
packages installed other than freeipa and what freeipa needs to
run, and the in-house program that loads the dns.
Just to exclude potential problems with schema compat, it can be
disabled if you are not using it.
How? The installs just use all the defaults, other than enabling
dnssec and PTR records for all a/aaaa.
I'm officially in 'desperation mode' as not being able to populate DNS
in freeipa reduces everyone to pencil and paper and coffee with full
project stoppage until it's fixed or at least 'worked around'. So
anything that 'might help' can be sacrificed so at least 'something'
works 'somewhat'. If old AD needs to be 'broken' or 'off' but mostly
the rest of it 'works sort of' then how do I do it?
Really this can't be hard to reproduce, it's just two instances with a
1G link between them, each with a pair of old rusty hard drives in an
lvm mirror using a COW file system, dnssec on, and one of them loading
lots of dns with reverse pointers for each A/AAAA with maybe 200 to
600 PTR records per *arpa and maybe 10-200 records per subdomain,
maybe 200 domains total. A couple python for loops and hey presto
you'll see freeipa lock up without notice in your lab as well. I just
can't imagine causing these race conditions to appear in the case of
the only important load being DNS adds/finds/shows should be
difficult.
I appreciate the help, and have become officially fearful about
freeipa. Maybe it's seldom used extensively for DNS and so my use
case is an outlier? Why are so few seeing this? It's a fully
default package install, no custom changes to the OS, freeipa, other
packages. I don't get it.
We don't see these problems in our tests, may be we are doing something
different.
As I said, disabling compat tree should help to avoid any potential
issues related to that plugin. See ipa-compat-manage and ipa-nis-manage
commands, they switch on/off individually both slapi-nis plugins.
Thanks for any leads or help!
I don't think it is about named per se, it is a bit of an unfortunate
interop inside ns-slapd between different plugins. bind-dyndb-ldap
relies on the syncrepl extension which implementation in ns-slapd is
using the retro changelog content. Retro changelog plugin triggers some
updates that cause schema compatibility plugin to lock itself up
depending on the order of updates that retro changelog would capture. We
fixed that in slapi-nis package some time ago and it *should* be
ignoring the retro changelog changes but somehow they still propagate
into it. There are few places in ns-slapd which were addressed just
recently and those updates might help (out later this year in RHEL).
Disabling schema compat would be the best.
What's worse, every reboot attempt waits the full '9 min 29 secs'
before systemd forcibly terminates ns-slapd to finish the 'stop
job'.
That's why I'm so troubled by all this, it's not like there is any
interference from anything other than what freeipa puts out there,
and it just locks with a message that gives no indication of what
to do about it, with nothing in any logs and 'systemctl
is-system-running' reports 'running'.
You could easily replicate this: imagine a simple validation test
that sets up two freeipa nodes, turns on dnssec, creates some
domains, then adds A AAAA and *.arpa records using the ldap2 api
on one of the nodes. Maybe limit the net speed between the nodes
to a 1GB link typical, maybe at most 4 processor cores of some
older vintage and 5GB memory. It takes less than 2 minutes after
dns load start to lock up.
What's really odd is bind9 / named keeps blasting out change
notifications for some of the updated domains, then a few lines
later, with no intervening activity in any log or by any program
affecting the zone, will publish further change notifications with
a new serial number for the same zone. This happens for all the
zones that get modifications. I'm thinking 'rr' computations? I
wonder if those entries-- being auto-generated internally -- are
creating a 'flow control' issue between the primary and replica.
This is something that retro changelog is responsible for as it is the
data store used by the syncrepl protocol implementation. If these
'changes' appear again and again, it means retro changelog plugin marks
them as new for this particular syncrepl client (bind-dyndb-ldap).
All threads other than the thread 30 are normal ones (idle threads) but
this one blocks the database backend in the log flush sequence while
writing the retro changelog entry for this updated DNS record:
Thread 30 (Thread 0x7f0e583ff700 (LWP 1438)):
#0 0x00007f0e9bf7d8af in fdatasync () at target:/lib64/libc.so.6
#1 0x00007f0e91cbe6b5 in __os_fsync () at target:/lib64/libdb-5.3.so
#2 0x00007f0e91ca598c in __log_flush_int () at
target:/lib64/libdb-5.3.so
#3 0x00007f0e91ca7dd0 in __log_flush () at target:/lib64/libdb-5.3.so
#4 0x00007f0e91ca7f73 in __log_flush_pp () at
target:/lib64/libdb-5.3.so
#5 0x00007f0e8afe1304 in bdb_txn_commit (li=<optimized out>,
txn=0x7f0e583fd028, use_lock=1) at
ldap/servers/slapd/back-ldbm/db-bdb/bdb_layer.c:2772
#6 0x00007f0e8af95515 in dblayer_txn_commit (be=0x7f0e88424f00,
txn=<optimized out>) at ldap/servers/slapd/back-ldbm/dblayer.c:736
#7 0x00007f0e8afa7ebe in ldbm_back_add (pb=0x7f0e85748860) at
ldap/servers/slapd/back-ldbm/ldbm_add.c:1242
#8 0x00007f0e9d7d7728 in op_shared_add
(pb=pb@entry=0x7f0e85748860) at ldap/servers/slapd/add.c:692
#9 0x00007f0e9d7d7bbe in add_internal_pb
(pb=pb@entry=0x7f0e85748860) at ldap/servers/slapd/add.c:407
#10 0x00007f0e9d7d8975 in slapi_add_internal_pb
(pb=pb@entry=0x7f0e85748860) at ldap/servers/slapd/add.c:331
#11 0x00007f0e8960f8bf in write_replog_db (newsuperior=0x0,
modrdn_mods=0x0, newrdn=0x0, post_entry=<optimized out>,
log_e=0x7f0e4df5b9c0, curtime=1691511446, flag=0,
log_m=0x7f0e5ec0d440, dn=0x7f0e57a40740 "idnsname=8.0.f.0.0.0.0.0.0.0.0.1.0.0.c.f.ip6.arpa.,cn=dns,dc=1,dc=quietfountain,dc=com",
optype=<optimized out>, pb=0x7f0e66a09580) at
ldap/servers/plugins/retrocl/retrocl_po.c:369 #12
0x00007f0e8960f8bf in retrocl_postob (pb=0x7f0e66a09580,
optype=<optimized out>) at
ldap/servers/plugins/retrocl/retrocl_po.c:697
#13 0x00007f0e9d83cc79 in plugin_call_func (list=0x7f0e924aae00,
operation=operation@entry=561, pb=pb@entry=0x7f0e66a09580,
call_one=call_one@entry=0) at ldap/servers/slapd/plugin.c:2032
#14 0x00007f0e9d83cec4 in plugin_call_list (pb=0x7f0e66a09580,
operation=561, list=<optimized out>) at
ldap/servers/slapd/plugin.c:1973
#15 0x00007f0e9d83cec4 in plugin_call_plugins
(pb=pb@entry=0x7f0e66a09580,
whichfunction=whichfunction@entry=561) at
ldap/servers/slapd/plugin.c:442
#16 0x00007f0e8afc3658 in ldbm_back_modify (pb=<optimized out>) at
ldap/servers/slapd/back-ldbm/ldbm_modify.c:1002
#17 0x00007f0e9d828300 in op_shared_modify
(pb=pb@entry=0x7f0e66a09580, pw_change=pw_change@entry=0,
old_pw=0x0) at ldap/servers/slapd/modify.c:1025
#18 0x00007f0e9d829a00 in do_modify (pb=pb@entry=0x7f0e66a09580)
at ldap/servers/slapd/modify.c:380
#19 0x0000564ed703475b in connection_dispatch_operation
(pb=0x7f0e66a09580, op=<optimized out>, conn=<optimized out>) at
ldap/servers/slapd/connection.c:651
#20 0x0000564ed703475b in connection_threadmain (arg=<optimized
out>) at ldap/servers/slapd/connection.c:1803
#21 0x00007f0e9a24b968 in _pt_root () at target:/lib64/libnspr4.so
#22 0x00007f0e99be61ca in start_thread () at
target:/lib64/libpthread.so.0
#23 0x00007f0e9be90e73 in clone () at target:/lib64/libc.so.6
Mark, Thierry, any hints here? (For full trace see thread
https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org/thread/TMRXHCORFU3QRQL6FSZTS4OIHYOAVXWF/)
--
/ Alexander Bokovoy
Sr. Principal Software Engineer
Security / Identity Management Engineering
Red Hat Limited, Finland
_______________________________________________
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org
Fedora Code of Conduct:
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives:
https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org
Do not reply to spam, report it:
https://pagure.io/fedora-infrastructure/new_issue