What is the OS of the deployment server?

kill -USR1 $(cat /var/run/confluent/pid)

This should produce a /var/log/confluennt/hangtraces

Would be interesting to see if there's ansible related stacks in hangtraces 
that seem stuck...


________________________________
From: David Magda <dmagda+x...@ee.torontomu.ca>
Sent: Thursday, January 25, 2024 4:25 PM
To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
Subject: Re: [xcat-user] [External] Ansible and Confluent

First suggested command:

"""
#   confluent_selfcheck
OS Deployment: Initialized
Confluent UUID: Consistent
Web Server: Running
Web Certificate: Traceback (most recent call last):
 File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module>
   cert = certificates_missing_ips(conn)
 File "/opt/confluent/bin/confluent_selfcheck", line 57, in 
certificates_missing_ips
   ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT'
"""

On the being-installed system, ignoring the typical Linux stuff, the output of 
'ps -elfH' has:

"""

4 S root        1247       1  0  80   0 -  7499 do_pol 17:53 ?        00:00:00  
 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
4 S root        1248       1  0  80   0 - 58623 do_pol 17:53 ?        00:00:00  
 /usr/libexec/polkitd --no-debug
4 S syslog      1250       1  0  80   0 - 55600 do_sel 17:53 ?        00:00:00  
 /usr/sbin/rsyslogd -n -iNONE
4 S root        1252       1  0  80   0 - 385081 futex_ 17:53 ?       00:00:03  
 /usr/lib/snapd/snapd
4 S root        1253       1  0  80   0 -  3831 ep_pol 17:53 ?        00:00:00  
 /lib/systemd/systemd-logind
4 S root        1255       1  0  80   0 - 98198 do_pol 17:53 ?        00:00:02  
 /usr/libexec/udisks2/udisksd
4 S root        1283       1  0  80   0 - 26778 do_pol 17:53 ?        00:00:00  
 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown 
--wait-for-signal
4 S root        1291       1  0  80   0 - 61055 do_pol 17:53 ?        00:00:00  
 /usr/sbin/ModemManager
4 S root        2042       1  0  80   0 -   722 do_wai 17:53 ?        00:00:00  
 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server
4 S root        2086    2042  0  80   0 - 149574 ep_pol 17:53 ?       00:00:07  
   /snap/subiquity/5004/usr/bin/python3.10 -m subiquity.cmd.server
4 S root       27499    2086  0  80   0 -   722 do_wai 18:09 ?        00:00:00  
     sh -c /custom-installation/post.sh
4 S root       27501   27499  0  80   0 -  1150 do_wai 18:09 ?        00:00:00  
       /bin/bash /custom-installation/post.sh
4 S root       27588   27501  4  80   0 -  7403 do_pol 18:09 ?        00:03:16  
         /usr/bin/python3 /opt/confluent/bin/apiclient 
/confluent-api/self/remoteconfig/status -w 204
4 S root        2049       1  0  80   0 - 24167 ep_pol 17:53 tty1     00:00:05  
 /snap/subiquity/5004/usr/bin/python3.10 /snap/subiquity/5004/usr/bin/subiquity
4 S root        2137       1  0  80   0 -  3855 do_pol 17:53 ?        00:00:00  
 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
4 S root       37842    2137  0  80   0 -  4310 -      19:15 ?        00:00:00  
   sshd: root@pts/0
4 S root       37952   37842  0  80   0 -  1543 do_wai 19:15 ?        00:00:00  
     -bash
4 R root       38032   37952  0  80   0 -  1911 -      19:16 ?        00:00:00  
       ps -elfH
4 S root        2206       1  0  80   0 -  3266 ep_pol 17:53 ?        00:00:00  
 /lib/netplan/netplan-dbus
4 S root        2570       1  0  80   0 - 73244 do_pol 17:53 ?        00:00:00  
 /usr/libexec/packagekitd
4 S root       37848       1  1  80   0 -  4301 ep_pol 19:15 ?        00:00:00  
 /lib/systemd/systemd --user
5 S root       37850   37848  0  80   0 - 26271 do_sig 19:15 ?        00:00:00  
   (sd-pam)
"""

While 'ps axf' produces (trimmed):

"""
  2042 ?        Ss     0:00 /bin/sh 
/snap/subiquity/5004/usr/bin/subiquity-server
  2086 ?        Sl     0:07  \_ /snap/subiquity/5004/usr/bin/python3.10 -m 
subiquity.cmd.server
 27499 ?        S      0:00      \_ sh -c /custom-installation/post.sh
 27501 ?        S      0:00          \_ /bin/bash /custom-installation/post.sh
 27588 ?        S      3:21              \_ /usr/bin/python3 
/opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204
  2049 tty1     Ss+    0:05 /snap/subiquity/5004/usr/bin/python3.10 
/snap/subiquity/5004/usr/bin/subiquity
"""

Doing a "kill -9 27588" (on apiclient) causes the installation to 'finish'. 
After the reboot, and after "firshboot.sh" does its thing, we have the 
following from 'ps axf':

"""
1372 ?        Ss     0:00 /usr/bin/python3 /usr/bin/cloud-init modules 
--mode=final
  1376 ?        S      0:00  \_ /bin/sh -c tee -a /var/log/cloud-init-output.log
  1377 ?        S      0:00  |   \_ tee -a /var/log/cloud-init-output.log
  1378 ?        S      0:00  \_ /bin/sh /var/lib/cloud/instance/scripts/runcmd
  1379 ?        S      0:00      \_ /bin/bash /etc/confluent/firstboot.sh
  1429 ?        S      0:01          \_ /usr/bin/python3 
/opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204
"""

This causes the "/var/log/httpd/ssl_access_log" to start filling up. A 
subsequent reboot, where "firstboot.sh" is not run, has the the system coming 
up without "apiclient" running, and so there's no longer 'spam' in 
"ssl_access_log".

Running "apiclient" manually from the CLI with the exact options causes a bunch 
of stuff in "ssl_access_log":

"""
fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET 
/confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
"""

at the same time as the above is being generated, there is nothing in 
"/var/log/confluent/trace" or "stderrā€.


On Thu, January 25, 2024 07:52, Jarrod Johnson wrote:
> Anything in /var/log/confluent/stderr or /var/log/confluent/trace?  Also
> would be tempted to see if 'confluent_selfcheck' has any suggestions.  You
> can also ssh into the node during that phase to confirm what it is doing
> while it is seemingly hung, e.g. looking at ps axf
> ________________________________
> From: David Magda <dmagda+x...@ee.torontomu.ca>
> Sent: Wednesday, January 24, 2024 9:37 PM
> To: xCAT-user@lists.sourceforge.net <xCAT-user@lists.sourceforge.net>
> Subject: [External] [xcat-user] Ansible and Confluent
>
> Hello,
>
> I'm trying to get Ansible working with Confluent 3.8.0. (Using an older
> version due to legacy OS reasons.)
>
> In /var/lib/confluent/public/os/ I created a new profile called
> ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took the
> provided "autoinstall/user-data" file, added some partition stanzas, some
> packages, etc.
>
> Once I sorted out a 'basic' automated Ubuntu install I tried creating a
> "ansible/post.d/01-packages.yaml" file with-in the profile directory with
> the following contents:
>
> """
> - name: install chrony
> apt:
>   pkg:
>     - chrony
> """
>
> The Ubuntu (subiquity) installer seems to 'hang' at:
>
> """
> start: subiquity/Late/run/command_1: /custom-installation/post.sh
> """
>
> which probably corresponds to this part of the "user-data" file:
>
> """
> late-commands:
>   - chroot /target apt-get -y -q purge snapd modemmanager
>   - /custom-installation/post.sh
> """
>
> When the 'hang' occurs the following starts filling up the
> "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server:
>
> """
> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
> """
>
> When I force a restart of the system/VM, it can boot off the disk, and
> goes through the regular start-up process, including a bunch of cloud-init
> stuff. Though after it runs "/etc/confluent/firstboot.sh", the
> "ssl_access_log" file once again starts filling with the
> "remoteconfig/status" stuff per above.
>
> Renaming "ansible/" to "ansible_off/" seems to make the problem go away.
> Similar behaviour with Ubuntu 20.04.
>
> I'm wondering what's going with the 'hang' when "post.sh" is executed, and
> the flooding after "firstboot.sh".
>
> Regards,
> David


_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C19f3a540a0bc4a2ca42c08dc1dec6e5e%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418148525412338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=45rqrSCFhmih33jrSi9cDz4vjZmDJq7fWnbRNEKV3b4%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user>
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to