One thing to mention, since I’m in the exact same sinking boat (not doing 
deployments though): confluent_selfcheck doesn’t work that reliably on RHEL7.

--
#BlackLivesMatter
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
     `'

On Jan 25, 2024, at 16:25, David Magda <dmagda+x...@ee.torontomu.ca> wrote:

First suggested command:

"""
#   confluent_selfcheck
OS Deployment: Initialized
Confluent UUID: Consistent
Web Server: Running
Web Certificate: Traceback (most recent call last):
File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module>
  cert = certificates_missing_ips(conn)
File "/opt/confluent/bin/confluent_selfcheck", line 57, in 
certificates_missing_ips
  ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT'
"""

On the being-installed system, ignoring the typical Linux stuff, the output of 
'ps -elfH' has:

"""

4 S root        1247       1  0  80   0 -  7499 do_pol 17:53 ?        00:00:00  
 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
4 S root        1248       1  0  80   0 - 58623 do_pol 17:53 ?        00:00:00  
 /usr/libexec/polkitd --no-debug
4 S syslog      1250       1  0  80   0 - 55600 do_sel 17:53 ?        00:00:00  
 /usr/sbin/rsyslogd -n -iNONE
4 S root        1252       1  0  80   0 - 385081 futex_ 17:53 ?       00:00:03  
 /usr/lib/snapd/snapd
4 S root        1253       1  0  80   0 -  3831 ep_pol 17:53 ?        00:00:00  
 /lib/systemd/systemd-logind
4 S root        1255       1  0  80   0 - 98198 do_pol 17:53 ?        00:00:02  
 /usr/libexec/udisks2/udisksd
4 S root        1283       1  0  80   0 - 26778 do_pol 17:53 ?        00:00:00  
 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown 
--wait-for-signal
4 S root        1291       1  0  80   0 - 61055 do_pol 17:53 ?        00:00:00  
 /usr/sbin/ModemManager
4 S root        2042       1  0  80   0 -   722 do_wai 17:53 ?        00:00:00  
 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server
4 S root        2086    2042  0  80   0 - 149574 ep_pol 17:53 ?       00:00:07  
   /snap/subiquity/5004/usr/bin/python3.10 -m subiquity.cmd.server
4 S root       27499    2086  0  80   0 -   722 do_wai 18:09 ?        00:00:00  
     sh -c /custom-installation/post.sh
4 S root       27501   27499  0  80   0 -  1150 do_wai 18:09 ?        00:00:00  
       /bin/bash /custom-installation/post.sh
4 S root       27588   27501  4  80   0 -  7403 do_pol 18:09 ?        00:03:16  
         /usr/bin/python3 /opt/confluent/bin/apiclient 
/confluent-api/self/remoteconfig/status -w 204
4 S root        2049       1  0  80   0 - 24167 ep_pol 17:53 tty1     00:00:05  
 /snap/subiquity/5004/usr/bin/python3.10 /snap/subiquity/5004/usr/bin/subiquity
4 S root        2137       1  0  80   0 -  3855 do_pol 17:53 ?        00:00:00  
 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
4 S root       37842    2137  0  80   0 -  4310 -      19:15 ?        00:00:00  
   sshd: root@pts/0
4 S root       37952   37842  0  80   0 -  1543 do_wai 19:15 ?        00:00:00  
     -bash
4 R root       38032   37952  0  80   0 -  1911 -      19:16 ?        00:00:00  
       ps -elfH
4 S root        2206       1  0  80   0 -  3266 ep_pol 17:53 ?        00:00:00  
 /lib/netplan/netplan-dbus
4 S root        2570       1  0  80   0 - 73244 do_pol 17:53 ?        00:00:00  
 /usr/libexec/packagekitd
4 S root       37848       1  1  80   0 -  4301 ep_pol 19:15 ?        00:00:00  
 /lib/systemd/systemd --user
5 S root       37850   37848  0  80   0 - 26271 do_sig 19:15 ?        00:00:00  
   (sd-pam)
"""

While 'ps axf' produces (trimmed):

"""
 2042 ?        Ss     0:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server
 2086 ?        Sl     0:07  \_ /snap/subiquity/5004/usr/bin/python3.10 -m 
subiquity.cmd.server
27499 ?        S      0:00      \_ sh -c /custom-installation/post.sh
27501 ?        S      0:00          \_ /bin/bash /custom-installation/post.sh
27588 ?        S      3:21              \_ /usr/bin/python3 
/opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204
 2049 tty1     Ss+    0:05 /snap/subiquity/5004/usr/bin/python3.10 
/snap/subiquity/5004/usr/bin/subiquity
"""

Doing a "kill -9 27588" (on apiclient) causes the installation to 'finish'. 
After the reboot, and after "firshboot.sh" does its thing, we have the 
following from 'ps axf':

"""
1372 ?        Ss     0:00 /usr/bin/python3 /usr/bin/cloud-init modules 
--mode=final
 1376 ?        S      0:00  \_ /bin/sh -c tee -a /var/log/cloud-init-output.log
 1377 ?        S      0:00  |   \_ tee -a /var/log/cloud-init-output.log
 1378 ?        S      0:00  \_ /bin/sh /var/lib/cloud/instance/scripts/runcmd
 1379 ?        S      0:00      \_ /bin/bash /etc/confluent/firstboot.sh
 1429 ?        S      0:01          \_ /usr/bin/python3 
/opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204
"""

This causes the "/var/log/httpd/ssl_access_log" to start filling up. A 
subsequent reboot, where "firstboot.sh" is not run, has the the system coming 
up without "apiclient" running, and so there's no longer 'spam' in 
"ssl_access_log".

Running "apiclient" manually from the CLI with the exact options causes a bunch 
of stuff in "ssl_access_log":

"""
fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET 
/confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
"""

at the same time as the above is being generated, there is nothing in 
"/var/log/confluent/trace" or "stderr”.


On Thu, January 25, 2024 07:52, Jarrod Johnson wrote:
Anything in /var/log/confluent/stderr or /var/log/confluent/trace?  Also
would be tempted to see if 'confluent_selfcheck' has any suggestions.  You
can also ssh into the node during that phase to confirm what it is doing
while it is seemingly hung, e.g. looking at ps axf
________________________________
From: David Magda <dmagda+x...@ee.torontomu.ca>
Sent: Wednesday, January 24, 2024 9:37 PM
To: xCAT-user@lists.sourceforge.net <xCAT-user@lists.sourceforge.net>
Subject: [External] [xcat-user] Ansible and Confluent

Hello,

I'm trying to get Ansible working with Confluent 3.8.0. (Using an older
version due to legacy OS reasons.)

In /var/lib/confluent/public/os/ I created a new profile called
ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took the
provided "autoinstall/user-data" file, added some partition stanzas, some
packages, etc.

Once I sorted out a 'basic' automated Ubuntu install I tried creating a
"ansible/post.d/01-packages.yaml" file with-in the profile directory with
the following contents:

"""
- name: install chrony
apt:
 pkg:
   - chrony
"""

The Ubuntu (subiquity) installer seems to 'hang' at:

"""
start: subiquity/Late/run/command_1: /custom-installation/post.sh
"""

which probably corresponds to this part of the "user-data" file:

"""
late-commands:
 - chroot /target apt-get -y -q purge snapd modemmanager
 - /custom-installation/post.sh
"""

When the 'hang' occurs the following starts filling up the
"/var/log/httpd/ssl_access_log" file of the Confluent/xcat server:

"""
fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
/confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
/confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
/confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
/confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
/confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
/confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
"""

When I force a restart of the system/VM, it can boot off the disk, and
goes through the regular start-up process, including a bunch of cloud-init
stuff. Though after it runs "/etc/confluent/firstboot.sh", the
"ssl_access_log" file once again starts filling with the
"remoteconfig/status" stuff per above.

Renaming "ansible/" to "ansible_off/" seems to make the problem go away.
Similar behaviour with Ubuntu 20.04.

I'm wondering what's going with the 'hang' when "post.sh" is executed, and
the flooding after "firstboot.sh".

Regards,
David


_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to