Ok, another track (trying to compensate for not being able to use selfcheck).

Can you try sticking some file in the profile's syncfiles, then do:
nodeapply -F <node>

And see if any errors happen, either in output or in the /var/log/confluet area.
________________________________
From: David Magda <dmagda+x...@ee.torontomu.ca>
Sent: Friday, January 26, 2024 2:01 PM
To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
Subject: Re: [xcat-user] [External] Ansible and Confluent

We have Confluent installed on a RH/CentOS 7 system that originally had/has 
xCat installed for deployment of our Lenovo hardware/HPC solution. I just 
installed it there as it was/is our 'install server'. (We don't want to touch 
it too much, as it was a previous team of folks that set things up, and there's 
been a lot of team churn.)

I've attached the "hangtraces" to this message; hopefully the mailing list 
software will pass it along. I noticed “ipmi” in some of the paths, and for the 
record this is a VM running under Proxmox, and does not have any LOM configured:

"""
# nodeattrib dm-boot1
dm-boot1: crypted.selfapikey: ********
dm-boot1: deployment.apiarmed:
dm-boot1: deployment.pendingprofile: ubuntu-22.04.3-x86_64-test1
dm-boot1: deployment.profile:
dm-boot1: deployment.sealedapikey:
dm-boot1: deployment.stagedprofile:
dm-boot1: deployment.state:
dm-boot1: deployment.state_detail:
dm-boot1: deployment.useinsecureprotocols: always
dm-boot1: dns.servers: 172.17.15.252,172.17.15.247,172.17.15.254
dm-boot1: groups: everything
dm-boot1: net.hwaddr: 4e:78:df:d3:8d:59
dm-boot1: net.ipv4_address: 172.17.15.222/21
dm-boot1: net.ipv4_gateway: 172.17.8.254
"""

Running an strace(1) on the 'apiclient' that runs as part of the "post.sh" 
process, we have a continuous poll/read/write stream:

"""
[…]
write(3, 
"\27\3\3\0\371Sm2\233\337\222n\221\377vZs\21\22S\10\351\232\321I7Y$R\370]\312"...,
 254) = 254
read(3, 0x560b6949e8f3, 5)              = -1 EAGAIN (Resource temporarily 
unavailable)
poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}])
read(3, "\27\3\3\0\226", 5)             = 5
read(3, 
"\0055\271\274&\2464\237\242h\341\30\231\274\327g\224\344g\306\313\206\326\355x\307\341\331C\366H\331"...,
 150) = 150
poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}])
write(3, 
"\27\3\3\0\371Sm2\233\337\222n\222\334e\336f\353u\343p\22\215:\264e\30a\3172\245\361"...,
 254) = 254
read(3, 0x560b6949e8f3, 5)              = -1 EAGAIN (Resource temporarily 
unavailable)
poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}])
read(3, "\27\3\3\0\226", 5)             = 5
read(3, 
"\0055\271\274&\2464\240\326\202\347(\213\311\260|\333\230\372A\235\341\273U\201\223\2209ah\325J"...,
 150) = 150
poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}])
write(3, 
"\27\3\3\0\371Sm2\233\337\222n\223\240\341<\3602\323\177Y\311\317/\371\336P/s\301t8"...,
 254) = 254
read(3, 0x560b6949e8f3, 5)              = -1 EAGAIN (Resource temporarily 
unavailable)
poll([{fd=3, events=POLLIN}], 1, 15000^Cstrace: Process 27477 detached
<detached ...>
"""

Per lsof(1), FD 3 is:

"""
python3 27477 root    3u  IPv6             158157      0t0    TCP 
[fe80::[EUI-64_client]]:44800->[fe80::[EUI-64_server]]:https (ESTABLISHED)
"""



On Thu, January 25, 2024 16:34, Jarrod Johnson wrote:
> What is the OS of the deployment server?
>
> kill -USR1 $(cat /var/run/confluent/pid)
>
> This should produce a /var/log/confluennt/hangtraces
>
> Would be interesting to see if there's ansible related stacks in
> hangtraces that seem stuck...
>
>
> ________________________________
> From: David Magda <dmagda+x...@ee.torontomu.ca>
> Sent: Thursday, January 25, 2024 4:25 PM
> To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
> Subject: Re: [xcat-user] [External] Ansible and Confluent
>
> First suggested command:
>
> """
> #   confluent_selfcheck
> OS Deployment: Initialized
> Confluent UUID: Consistent
> Web Server: Running
> Web Certificate: Traceback (most recent call last):
> File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module>
>   cert = certificates_missing_ips(conn)
> File "/opt/confluent/bin/confluent_selfcheck", line 57, in
> certificates_missing_ips
>   ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
> AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT'
> """
>
> On the being-installed system, ignoring the typical Linux stuff, the
> output of 'ps -elfH' has:
>
> """
>
> 4 S root        1247       1  0  80   0 -  7499 do_pol 17:53 ?
> 00:00:00   /usr/bin/python3 /usr/bin/networkd-dispatcher
> --run-startup-triggers
> 4 S root        1248       1  0  80   0 - 58623 do_pol 17:53 ?
> 00:00:00   /usr/libexec/polkitd --no-debug
> 4 S syslog      1250       1  0  80   0 - 55600 do_sel 17:53 ?
> 00:00:00   /usr/sbin/rsyslogd -n -iNONE
> 4 S root        1252       1  0  80   0 - 385081 futex_ 17:53 ?
> 00:00:03   /usr/lib/snapd/snapd
> 4 S root        1253       1  0  80   0 -  3831 ep_pol 17:53 ?
> 00:00:00   /lib/systemd/systemd-logind
> 4 S root        1255       1  0  80   0 - 98198 do_pol 17:53 ?
> 00:00:02   /usr/libexec/udisks2/udisksd
> 4 S root        1283       1  0  80   0 - 26778 do_pol 17:53 ?
> 00:00:00   /usr/bin/python3
> /usr/share/unattended-upgrades/unattended-upgrade-shutdown
> --wait-for-signal
> 4 S root        1291       1  0  80   0 - 61055 do_pol 17:53 ?
> 00:00:00   /usr/sbin/ModemManager
> 4 S root        2042       1  0  80   0 -   722 do_wai 17:53 ?
> 00:00:00   /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server
> 4 S root        2086    2042  0  80   0 - 149574 ep_pol 17:53 ?
> 00:00:07     /snap/subiquity/5004/usr/bin/python3.10 -m
> subiquity.cmd.server
> 4 S root       27499    2086  0  80   0 -   722 do_wai 18:09 ?
> 00:00:00       sh -c /custom-installation/post.sh
> 4 S root       27501   27499  0  80   0 -  1150 do_wai 18:09 ?
> 00:00:00         /bin/bash /custom-installation/post.sh
> 4 S root       27588   27501  4  80   0 -  7403 do_pol 18:09 ?
> 00:03:16           /usr/bin/python3 /opt/confluent/bin/apiclient
> /confluent-api/self/remoteconfig/status -w 204
> 4 S root        2049       1  0  80   0 - 24167 ep_pol 17:53 tty1
> 00:00:05   /snap/subiquity/5004/usr/bin/python3.10
> /snap/subiquity/5004/usr/bin/subiquity
> 4 S root        2137       1  0  80   0 -  3855 do_pol 17:53 ?
> 00:00:00   sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
> 4 S root       37842    2137  0  80   0 -  4310 -      19:15 ?
> 00:00:00     sshd: root@pts/0
> 4 S root       37952   37842  0  80   0 -  1543 do_wai 19:15 ?
> 00:00:00       -bash
> 4 R root       38032   37952  0  80   0 -  1911 -      19:16 ?
> 00:00:00         ps -elfH
> 4 S root        2206       1  0  80   0 -  3266 ep_pol 17:53 ?
> 00:00:00   /lib/netplan/netplan-dbus
> 4 S root        2570       1  0  80   0 - 73244 do_pol 17:53 ?
> 00:00:00   /usr/libexec/packagekitd
> 4 S root       37848       1  1  80   0 -  4301 ep_pol 19:15 ?
> 00:00:00   /lib/systemd/systemd --user
> 5 S root       37850   37848  0  80   0 - 26271 do_sig 19:15 ?
> 00:00:00     (sd-pam)
> """
>
> While 'ps axf' produces (trimmed):
>
> """
>  2042 ?        Ss     0:00 /bin/sh
> /snap/subiquity/5004/usr/bin/subiquity-server
>  2086 ?        Sl     0:07  \_ /snap/subiquity/5004/usr/bin/python3.10 -m
> subiquity.cmd.server
> 27499 ?        S      0:00      \_ sh -c /custom-installation/post.sh
> 27501 ?        S      0:00          \_ /bin/bash
> /custom-installation/post.sh
> 27588 ?        S      3:21              \_ /usr/bin/python3
> /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w
> 204
>  2049 tty1     Ss+    0:05 /snap/subiquity/5004/usr/bin/python3.10
> /snap/subiquity/5004/usr/bin/subiquity
> """
>
> Doing a "kill -9 27588" (on apiclient) causes the installation to
> 'finish'. After the reboot, and after "firshboot.sh" does its thing, we
> have the following from 'ps axf':
>
> """
> 1372 ?        Ss     0:00 /usr/bin/python3 /usr/bin/cloud-init modules
> --mode=final
>  1376 ?        S      0:00  \_ /bin/sh -c tee -a
> /var/log/cloud-init-output.log
>  1377 ?        S      0:00  |   \_ tee -a /var/log/cloud-init-output.log
>  1378 ?        S      0:00  \_ /bin/sh
> /var/lib/cloud/instance/scripts/runcmd
>  1379 ?        S      0:00      \_ /bin/bash /etc/confluent/firstboot.sh
>  1429 ?        S      0:01          \_ /usr/bin/python3
> /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w
> 204
> """
>
> This causes the "/var/log/httpd/ssl_access_log" to start filling up. A
> subsequent reboot, where "firstboot.sh" is not run, has the the system
> coming up without "apiclient" running, and so there's no longer 'spam' in
> "ssl_access_log".
>
> Running "apiclient" manually from the CLI with the exact options causes a
> bunch of stuff in "ssl_access_log":
>
> """
> fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET
> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
> """
>
> at the same time as the above is being generated, there is nothing in
> "/var/log/confluent/trace" or "stderr�.
>
>
> On Thu, January 25, 2024 07:52, Jarrod Johnson wrote:
>> Anything in /var/log/confluent/stderr or /var/log/confluent/trace?  Also
>> would be tempted to see if 'confluent_selfcheck' has any suggestions.
>> You
>> can also ssh into the node during that phase to confirm what it is doing
>> while it is seemingly hung, e.g. looking at ps axf
>> ________________________________
>> From: David Magda <dmagda+x...@ee.torontomu.ca>
>> Sent: Wednesday, January 24, 2024 9:37 PM
>> To: xCAT-user@lists.sourceforge.net <xCAT-user@lists.sourceforge.net>
>> Subject: [External] [xcat-user] Ansible and Confluent
>>
>> Hello,
>>
>> I'm trying to get Ansible working with Confluent 3.8.0. (Using an older
>> version due to legacy OS reasons.)
>>
>> In /var/lib/confluent/public/os/ I created a new profile called
>> ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took
>> the
>> provided "autoinstall/user-data" file, added some partition stanzas,
>> some
>> packages, etc.
>>
>> Once I sorted out a 'basic' automated Ubuntu install I tried creating a
>> "ansible/post.d/01-packages.yaml" file with-in the profile directory
>> with
>> the following contents:
>>
>> """
>> - name: install chrony
>> apt:
>>  pkg:
>>    - chrony
>> """
>>
>> The Ubuntu (subiquity) installer seems to 'hang' at:
>>
>> """
>> start: subiquity/Late/run/command_1: /custom-installation/post.sh
>> """
>>
>> which probably corresponds to this part of the "user-data" file:
>>
>> """
>> late-commands:
>>  - chroot /target apt-get -y -q purge snapd modemmanager
>>  - /custom-installation/post.sh
>> """
>>
>> When the 'hang' occurs the following starts filling up the
>> "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server:
>>
>> """
>> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
>> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
>> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
>> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
>> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
>> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
>> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
>> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
>> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
>> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
>> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
>> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
>> """
>>
>> When I force a restart of the system/VM, it can boot off the disk, and
>> goes through the regular start-up process, including a bunch of
>> cloud-init
>> stuff. Though after it runs "/etc/confluent/firstboot.sh", the
>> "ssl_access_log" file once again starts filling with the
>> "remoteconfig/status" stuff per above.
>>
>> Renaming "ansible/" to "ansible_off/" seems to make the problem go away.
>> Similar behaviour with Ubuntu 20.04.
>>
>> I'm wondering what's going with the 'hang' when "post.sh" is executed,
>> and
>> the flooding after "firstboot.sh".
>>
>> Regards,
>> David
>
>

_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to