There’s no “syncfiles” in the default Ubuntu profile, nor anything in the web 
docs on its format, but I found a template in 
"/opt/confluent/lib/osdeploy/el9/profiles/default/syncfiles”.

Created a file with the line:

        /etc/hosts -> /etc/hosts_test

With the results:

"""
#  nodeapply -F dm-boot1
dm-boot1: 
dm-boot1: 
---------------------------------------------------------------------------
dm-boot1: Running python script 'syncfileclient' from 
https://[fe80::749f:43ff:fe72:55e4%2]/confluent-public/os/ubuntu-22.04.3-x86_64-test1/scripts/
dm-boot1: Executing in /tmp/confluentscripts.HUGo3sMtt
dm-boot1: Traceback (most recent call last):
dm-boot1:   File "/tmp/confluentscripts.HUGo3sMtt/syncfileclient", line 286, in 
<module>
dm-boot1:     synchronize()
dm-boot1:   File "/tmp/confluentscripts.HUGo3sMtt/syncfileclient", line 233, in 
synchronize
dm-boot1:     status, rsp = 
ac.grab_url_with_status('/confluent-api/self/remotesyncfiles')
dm-boot1:   File "/opt/confluent/bin/apiclient", line 413, in 
grab_url_with_status
dm-boot1:     raise Exception(rsp.read())
dm-boot1: Exception: b"500 - Command '['rsync', '-rvLD', 
'/tmp/tmpSUbmoD.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero 
exit status 255"
dm-boot1: 'syncfileclient' exited with code 1
"""

In "/var/log/confluent/stderr” we have:

"""
Jan 26 15:28:53   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator): Traceback (most recent call last):
Jan 26 15:28:53   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 111, in wait
Jan 26 15:28:53   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     listener.cb(fileno)
Jan 26 15:28:53   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 53, in on_read
Jan 26 15:28:53   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     current.switch(([original], [], []))
Jan 26 15:28:53   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 221, in main
Jan 26 15:28:53   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     result = function(*args, **kwargs)
Jan 26 15:28:53   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in 
sync_list_to_node
Jan 26 15:28:53   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     ['rsync', '-rvLD', targdir + '/', 
'root@[{}]:/'.format(targip)])[0]
Jan 26 15:28:53   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File 
"/opt/confluent/lib/python/confluent/util.py", line 45, in run
Jan 26 15:28:53   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     raise 
subprocess.CalledProcessError(retcode, process.args, output=stdout)
Jan 26 15:28:53   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator): CalledProcessError: Command '['rsync', '-rvLD', 
'/tmp/tmpVbi9YY.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero 
exit status 255
Jan 26 15:28:53   File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", 
line 317, in squelch_exception
    sys.stderr.write("Removing descriptor: %r\n" % (fileno,)): Removing 
descriptor: 65
"""

And in “trace” we have:

"""
Jan 26 15:28:53 Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in 
resourcehandler
    for rsp in resourcehandler_backend(env, start_response):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 636, in 
resourcehandler_backend
    for res in selfservice.handle_request(env, start_response):
  File "/opt/confluent/lib/python/confluent/selfservice.py", line 502, in 
handle_request
    status, output = syncfiles.get_syncresult(nodename)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 321, in 
get_syncresult
    result = syncrunners[nodename].wait()
  File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 181, in 
wait
    return self._exit_event.wait()
  File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 132, in wait
    current.throw(*self._exc)
  File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 221, in 
main
    result = function(*args, **kwargs)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in 
sync_list_to_node
    ['rsync', '-rvLD', targdir + '/', 'root@[{}]:/'.format(targip)])[0]
  File "/opt/confluent/lib/python/confluent/util.py", line 45, in run
    raise subprocess.CalledProcessError(retcode, process.args, output=stdout)
CalledProcessError: Command '['rsync', '-rvLD', 
'/tmp/tmpVbi9YY.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero 
exit status 255

"""

> On Jan 26, 2024, at 15:01, Jarrod Johnson <jjohns...@lenovo.com> wrote:
> 
> Ok, another track (trying to compensate for not being able to use selfcheck).
> 
> Can you try sticking some file in the profile's syncfiles, then do:
> nodeapply -F <node>
> 
> And see if any errors happen, either in output or in the /var/log/confluet 
> area.
> 
>> From: David Magda <dmagda+x...@ee.torontomu.ca>
>> Sent: Friday, January 26, 2024 2:01 PM
>> To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
>> Subject: Re: [xcat-user] [External] Ansible and Confluent
>>  
>> We have Confluent installed on a RH/CentOS 7 system that originally had/has 
>> xCat installed for deployment of our Lenovo hardware/HPC solution. I just 
>> installed it there as it was/is our 'install server'. (We don't want to 
>> touch it too much, as it was a previous team of folks that set things up, 
>> and there's been a lot of team churn.)
>> 
>> I've attached the "hangtraces" to this message; hopefully the mailing list 
>> software will pass it along. I noticed “ipmi” in some of the paths, and for 
>> the record this is a VM running under Proxmox, and does not have any LOM 
>> configured:
>> 
>> """
>> # nodeattrib dm-boot1
>> dm-boot1: crypted.selfapikey: ********
>> dm-boot1: deployment.apiarmed:
>> dm-boot1: deployment.pendingprofile: ubuntu-22.04.3-x86_64-test1
>> dm-boot1: deployment.profile:
>> dm-boot1: deployment.sealedapikey:
>> dm-boot1: deployment.stagedprofile:
>> dm-boot1: deployment.state:
>> dm-boot1: deployment.state_detail:
>> dm-boot1: deployment.useinsecureprotocols: always
>> dm-boot1: dns.servers: 172.17.15.252,172.17.15.247,172.17.15.254
>> dm-boot1: groups: everything
>> dm-boot1: net.hwaddr: 4e:78:df:d3:8d:59
>> dm-boot1: net.ipv4_address: 172.17.15.222/21
>> dm-boot1: net.ipv4_gateway: 172.17.8.254
>> """
>> 
>> Running an strace(1) on the 'apiclient' that runs as part of the "post.sh" 
>> process, we have a continuous poll/read/write stream:
>> 
>> """
>> […]
>> write(3, 
>> "\27\3\3\0\371Sm2\233\337\222n\221\377vZs\21\22S\10\351\232\321I7Y$R\370]\312"...,
>>  254) = 254
>> read(3, 0x560b6949e8f3, 5)              = -1 EAGAIN (Resource temporarily 
>> unavailable)
>> poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}])
>> read(3, "\27\3\3\0\226", 5)             = 5
>> read(3, 
>> "\0055\271\274&\2464\237\242h\341\30\231\274\327g\224\344g\306\313\206\326\355x\307\341\331C\366H\331"...,
>>  150) = 150
>> poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}])
>> write(3, 
>> "\27\3\3\0\371Sm2\233\337\222n\222\334e\336f\353u\343p\22\215:\264e\30a\3172\245\361"...,
>>  254) = 254
>> read(3, 0x560b6949e8f3, 5)              = -1 EAGAIN (Resource temporarily 
>> unavailable)
>> poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}])
>> read(3, "\27\3\3\0\226", 5)             = 5
>> read(3, 
>> "\0055\271\274&\2464\240\326\202\347(\213\311\260|\333\230\372A\235\341\273U\201\223\2209ah\325J"...,
>>  150) = 150
>> poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}])
>> write(3, 
>> "\27\3\3\0\371Sm2\233\337\222n\223\240\341<\3602\323\177Y\311\317/\371\336P/s\301t8"...,
>>  254) = 254
>> read(3, 0x560b6949e8f3, 5)              = -1 EAGAIN (Resource temporarily 
>> unavailable)
>> poll([{fd=3, events=POLLIN}], 1, 15000^Cstrace: Process 27477 detached
>> <detached ...>
>> """
>> 
>> Per lsof(1), FD 3 is:
>> 
>> """
>> python3 27477 root    3u  IPv6             158157      0t0    TCP 
>> [fe80::[EUI-64_client]]:44800->[fe80::[EUI-64_server]]:https (ESTABLISHED)
>> """
>> 
>> 
>> 
>> On Thu, January 25, 2024 16:34, Jarrod Johnson wrote:
>> > What is the OS of the deployment server?
>> > 
>> > kill -USR1 $(cat /var/run/confluent/pid)
>> > 
>> > This should produce a /var/log/confluennt/hangtraces
>> > 
>> > Would be interesting to see if there's ansible related stacks in
>> > hangtraces that seem stuck...
>> > 
>> > 
>> > ________________________________
>> > From: David Magda <dmagda+x...@ee.torontomu.ca>
>> > Sent: Thursday, January 25, 2024 4:25 PM
>> > To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
>> > Subject: Re: [xcat-user] [External] Ansible and Confluent
>> > 
>> > First suggested command:
>> > 
>> > """
>> > #   confluent_selfcheck
>> > OS Deployment: Initialized
>> > Confluent UUID: Consistent
>> > Web Server: Running
>> > Web Certificate: Traceback (most recent call last):
>> > File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module>
>> >   cert = certificates_missing_ips(conn)
>> > File "/opt/confluent/bin/confluent_selfcheck", line 57, in
>> > certificates_missing_ips
>> >   ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
>> > AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT'
>> > """
>> > 
>> > On the being-installed system, ignoring the typical Linux stuff, the
>> > output of 'ps -elfH' has:
>> > 
>> > """
>> > 
>> > 4 S root        1247       1  0  80   0 -  7499 do_pol 17:53 ?       
>> > 00:00:00   /usr/bin/python3 /usr/bin/networkd-dispatcher
>> > --run-startup-triggers
>> > 4 S root        1248       1  0  80   0 - 58623 do_pol 17:53 ?       
>> > 00:00:00   /usr/libexec/polkitd --no-debug
>> > 4 S syslog      1250       1  0  80   0 - 55600 do_sel 17:53 ?       
>> > 00:00:00   /usr/sbin/rsyslogd -n -iNONE
>> > 4 S root        1252       1  0  80   0 - 385081 futex_ 17:53 ?      
>> > 00:00:03   /usr/lib/snapd/snapd
>> > 4 S root        1253       1  0  80   0 -  3831 ep_pol 17:53 ?       
>> > 00:00:00   /lib/systemd/systemd-logind
>> > 4 S root        1255       1  0  80   0 - 98198 do_pol 17:53 ?       
>> > 00:00:02   /usr/libexec/udisks2/udisksd
>> > 4 S root        1283       1  0  80   0 - 26778 do_pol 17:53 ?       
>> > 00:00:00   /usr/bin/python3
>> > /usr/share/unattended-upgrades/unattended-upgrade-shutdown
>> > --wait-for-signal
>> > 4 S root        1291       1  0  80   0 - 61055 do_pol 17:53 ?       
>> > 00:00:00   /usr/sbin/ModemManager
>> > 4 S root        2042       1  0  80   0 -   722 do_wai 17:53 ?       
>> > 00:00:00   /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server
>> > 4 S root        2086    2042  0  80   0 - 149574 ep_pol 17:53 ?      
>> > 00:00:07     /snap/subiquity/5004/usr/bin/python3.10 -m
>> > subiquity.cmd.server
>> > 4 S root       27499    2086  0  80   0 -   722 do_wai 18:09 ?       
>> > 00:00:00       sh -c /custom-installation/post.sh
>> > 4 S root       27501   27499  0  80   0 -  1150 do_wai 18:09 ?       
>> > 00:00:00         /bin/bash /custom-installation/post.sh
>> > 4 S root       27588   27501  4  80   0 -  7403 do_pol 18:09 ?       
>> > 00:03:16           /usr/bin/python3 /opt/confluent/bin/apiclient
>> > /confluent-api/self/remoteconfig/status -w 204
>> > 4 S root        2049       1  0  80   0 - 24167 ep_pol 17:53 tty1    
>> > 00:00:05   /snap/subiquity/5004/usr/bin/python3.10
>> > /snap/subiquity/5004/usr/bin/subiquity
>> > 4 S root        2137       1  0  80   0 -  3855 do_pol 17:53 ?       
>> > 00:00:00   sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
>> > 4 S root       37842    2137  0  80   0 -  4310 -      19:15 ?       
>> > 00:00:00     sshd: root@pts/0
>> > 4 S root       37952   37842  0  80   0 -  1543 do_wai 19:15 ?       
>> > 00:00:00       -bash
>> > 4 R root       38032   37952  0  80   0 -  1911 -      19:16 ?       
>> > 00:00:00         ps -elfH
>> > 4 S root        2206       1  0  80   0 -  3266 ep_pol 17:53 ?       
>> > 00:00:00   /lib/netplan/netplan-dbus
>> > 4 S root        2570       1  0  80   0 - 73244 do_pol 17:53 ?       
>> > 00:00:00   /usr/libexec/packagekitd
>> > 4 S root       37848       1  1  80   0 -  4301 ep_pol 19:15 ?       
>> > 00:00:00   /lib/systemd/systemd --user
>> > 5 S root       37850   37848  0  80   0 - 26271 do_sig 19:15 ?       
>> > 00:00:00     (sd-pam)
>> > """
>> > 
>> > While 'ps axf' produces (trimmed):
>> > 
>> > """
>> >  2042 ?        Ss     0:00 /bin/sh
>> > /snap/subiquity/5004/usr/bin/subiquity-server
>> >  2086 ?        Sl     0:07  \_ /snap/subiquity/5004/usr/bin/python3.10 -m
>> > subiquity.cmd.server
>> > 27499 ?        S      0:00      \_ sh -c /custom-installation/post.sh
>> > 27501 ?        S      0:00          \_ /bin/bash
>> > /custom-installation/post.sh
>> > 27588 ?        S      3:21              \_ /usr/bin/python3
>> > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w
>> > 204
>> >  2049 tty1     Ss+    0:05 /snap/subiquity/5004/usr/bin/python3.10
>> > /snap/subiquity/5004/usr/bin/subiquity
>> > """
>> > 
>> > Doing a "kill -9 27588" (on apiclient) causes the installation to
>> > 'finish'. After the reboot, and after "firshboot.sh" does its thing, we
>> > have the following from 'ps axf':
>> > 
>> > """
>> > 1372 ?        Ss     0:00 /usr/bin/python3 /usr/bin/cloud-init modules
>> > --mode=final
>> >  1376 ?        S      0:00  \_ /bin/sh -c tee -a
>> > /var/log/cloud-init-output.log
>> >  1377 ?        S      0:00  |   \_ tee -a /var/log/cloud-init-output.log
>> >  1378 ?        S      0:00  \_ /bin/sh
>> > /var/lib/cloud/instance/scripts/runcmd
>> >  1379 ?        S      0:00      \_ /bin/bash /etc/confluent/firstboot.sh
>> >  1429 ?        S      0:01          \_ /usr/bin/python3
>> > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w
>> > 204
>> > """
>> > 
>> > This causes the "/var/log/httpd/ssl_access_log" to start filling up. A
>> > subsequent reboot, where "firstboot.sh" is not run, has the the system
>> > coming up without "apiclient" running, and so there's no longer 'spam' in
>> > "ssl_access_log".
>> > 
>> > Running "apiclient" manually from the CLI with the exact options causes a
>> > bunch of stuff in "ssl_access_log":
>> > 
>> > """
>> > fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET
>> > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
>> > """
>> > 
>> > at the same time as the above is being generated, there is nothing in
>> > "/var/log/confluent/trace" or "stderr�.
>> > 
>> > 
>> > On Thu, January 25, 2024 07:52, Jarrod Johnson wrote:
>> >> Anything in /var/log/confluent/stderr or /var/log/confluent/trace?  Also
>> >> would be tempted to see if 'confluent_selfcheck' has any suggestions. 
>> >> You
>> >> can also ssh into the node during that phase to confirm what it is doing
>> >> while it is seemingly hung, e.g. looking at ps axf
>> >> ________________________________
>> >> From: David Magda <dmagda+x...@ee.torontomu.ca>
>> >> Sent: Wednesday, January 24, 2024 9:37 PM
>> >> To: xCAT-user@lists.sourceforge.net <xCAT-user@lists.sourceforge.net>
>> >> Subject: [External] [xcat-user] Ansible and Confluent
>> >> 
>> >> Hello,
>> >> 
>> >> I'm trying to get Ansible working with Confluent 3.8.0. (Using an older
>> >> version due to legacy OS reasons.)
>> >> 
>> >> In /var/lib/confluent/public/os/ I created a new profile called
>> >> ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took
>> >> the
>> >> provided "autoinstall/user-data" file, added some partition stanzas,
>> >> some
>> >> packages, etc.
>> >> 
>> >> Once I sorted out a 'basic' automated Ubuntu install I tried creating a
>> >> "ansible/post.d/01-packages.yaml" file with-in the profile directory
>> >> with
>> >> the following contents:
>> >> 
>> >> """
>> >> - name: install chrony
>> >> apt:
>> >>  pkg:
>> >>    - chrony
>> >> """
>> >> 
>> >> The Ubuntu (subiquity) installer seems to 'hang' at:
>> >> 
>> >> """
>> >> start: subiquity/Late/run/command_1: /custom-installation/post.sh
>> >> """
>> >> 
>> >> which probably corresponds to this part of the "user-data" file:
>> >> 
>> >> """
>> >> late-commands:
>> >>  - chroot /target apt-get -y -q purge snapd modemmanager
>> >>  - /custom-installation/post.sh
>> >> """
>> >> 
>> >> When the 'hang' occurs the following starts filling up the
>> >> "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server:
>> >> 
>> >> """
>> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
>> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
>> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
>> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
>> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
>> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
>> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
>> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
>> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
>> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
>> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET
>> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 -
>> >> """
>> >> 
>> >> When I force a restart of the system/VM, it can boot off the disk, and
>> >> goes through the regular start-up process, including a bunch of
>> >> cloud-init
>> >> stuff. Though after it runs "/etc/confluent/firstboot.sh", the
>> >> "ssl_access_log" file once again starts filling with the
>> >> "remoteconfig/status" stuff per above.
>> >> 
>> >> Renaming "ansible/" to "ansible_off/" seems to make the problem go away.
>> >> Similar behaviour with Ubuntu 20.04.
>> >> 
>> >> I'm wondering what's going with the 'hang' when "post.sh" is executed,
>> >> and
>> >> the flooding after "firstboot.sh".
>> >> 
>> >> Regards,
>> >> David
>> > 
>> > 
>>  
>> _______________________________________________
>> xCAT-user mailing list
>> xCAT-user@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/xcat-user
>> _______________________________________________
>> xCAT-user mailing list
>> xCAT-user@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/xcat-user
> 


_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to