[ceph-users] Re: Failing to restart mon and mgr daemons on Pacific
Hi Adam! I guess you only want the output for the 9100 port? [root@darkside1]# ss -tulpn | grep 9100 tcp LISTEN 0 128 [::]:9100 [::]:* users:(("node_exporter",pid=9103,fd=3)) Also, this: [root@darkside1 ~]# ps aux | grep 9103 nfsnobo+ 9103 38.4 0.0 152332 105760 ? Ssl 10:12 82:35 /bin/node_exporter --no-collector.timex Cordially, Renata. On 7/25/23 13:22, Adam King wrote: okay, not much info on the mon failure. The other one at least seems to be a simple port conflict. What does `sudo netstat -tulpn` give you on that host? On Tue, Jul 25, 2023 at 12:00 PM Renata Callado Borges wrote: Hi Adam! Thank you for your response, but I am still trying to figure out the issue. I am pretty sure the problem occurs "inside" the container, and I don´t know how to get logs from there. Just in case, this is what systemd sees: Jul 25 12:36:32 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for 920740ee-cf2d-11ed-9097-08c0eb320eda. Jul 25 12:36:32 darkside1 systemd[1]: Starting Ceph mon.darkside1 for 920740ee-cf2d-11ed-9097-08c0eb320eda... Jul 25 12:36:33 darkside1 bash[52271]: ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1 Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.32233695 -0300 -03 m=+0.131005321 container create 7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54 (image=quay.io/ceph/ceph:v15 <http://quay.io/ceph/ceph:v15>, name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1) Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.526241218 -0300 -03 m=+0.334909578 container init 7cf1d340e0a9658b2c9da c3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54 (image=quay.io/ceph/ceph:v15 <http://quay.io/ceph/ceph:v15>, name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksid e1) Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.556646854 -0300 -03 m=+0.365315225 container start 7cf1d340e0a9658b2c9d ac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54 (image=quay.io/ceph/ceph:v15 <http://quay.io/ceph/ceph:v15>, name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksi de1) Jul 25 12:36:33 darkside1 bash[52271]: 7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54 Jul 25 12:36:33 darkside1 systemd[1]: Started Ceph mon.darkside1 for 920740ee-cf2d-11ed-9097-08c0eb320eda. Jul 25 12:36:43 darkside1 systemd[1]: ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service: main process exited, code=exi ted, status=1/FAILURE Jul 25 12:36:43 darkside1 systemd[1]: Unit ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service entered failed state. Jul 25 12:36:43 darkside1 systemd[1]: ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service failed. Jul 25 12:36:53 darkside1 systemd[1]: ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service holdoff time over, scheduling restart. Jul 25 12:36:53 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for 920740ee-cf2d-11ed-9097-08c0eb320eda. Jul 25 12:36:53 darkside1 systemd[1]: start request repeated too quickly for ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1 .service Jul 25 12:36:53 darkside1 systemd[1]: Failed to start Ceph mon.darkside1 for 920740ee-cf2d-11ed-9097-08c0eb320eda. Jul 25 12:36:53 darkside1 systemd[1]: Unit ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service entered failed state. Jul 25 12:36:53 darkside1 systemd[1]: ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service failed. Also, I get the following error every 10 minutes or so on "ceph -W cephadm --watch-debug": 2023-07-25T12:35:38.115146-0300 mgr.darkside3.ujjyun [INF] Deploying daemon node-exporter.darkside1 on darkside1 2023-07-25T12:35:38.612569-0300 mgr.darkside3.ujjyun [ERR] cephadm exited with an error code: 1, stderr:Deploy daemon node-exporter. darkside1 ... Verifying port 9100 ... Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use ERROR: TCP Port(s) '9100' required for node-exporter already in use Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/module.py", line 1029, in _remote_connection yield (conn, connr) File "/usr/share/ceph/mgr/cephadm/module.py", line 1185, in _run_cephadm code, '\n'.join(err))) orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:Deploy daemon node-exporter.darkside1 ... Verifying port 9100 ... Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use ERROR: TCP Port(s) '9100' required for node-exporter already in use And finally I get th
[ceph-users] Re: Failing to restart mon and mgr daemons on Pacific
For cephadm deployments, the systemd unit is run through a "unit.run" file in /var/lib/ceph///unit.run. If you go to the very end of that file, which will be a very long podman or docker run command, add in the "--debug_ms 20" and then restart the systemd unit for that daemon, it should cause the extra debug logging to happen from that daemon. I would say first check if there are useful errors in the journal logs mentioned above before trying that though. On Mon, Jul 24, 2023 at 9:47 AM Renata Callado Borges wrote: Dear all, How are you? I have a cluster on Pacific with 3 hosts, each one with 1 mon, 1 mgr and 12 OSDs. One of the hosts, darkside1, has been out of quorum according to ceph status. Systemd showed 4 services dead, two mons and two mgrs. I managed to systemctl restart one mon and one mgr, but even after several attempts, the remaining mon and mgr services, when asked to restart, keep returning to a failed state after a few seconds. They try to auto-restart and then go into a failed state where systemd requires me to manually set them to "reset-failed" before trying to start again. But they never stay up. There are no clear messages about the issue in /var/log/ceph/cephadm.log. The host is still out of quorum. I have failed to "turn on debug" as per https://docs.ceph.com/en/pacific/rados/troubleshooting/log-and-debug/. It seems I do not know the proper incantantion for "ceph daemon X config show", no string for X seems to satisfy this command. I have tried adding this: [mon] debug mon = 20 To my ceph.conf, but no additional lines of log are sent to /var/log/cephadm.log so I'm sorry I can´t provide more details. Could someone help me debug this situation? I am sure that if just reboot the machine, it will start up the services properly, as it always has done, but I would prefer to fix this without this action. Cordially, Renata. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Failing to restart mon and mgr daemons on Pacific
Dear all, How are you? I have a cluster on Pacific with 3 hosts, each one with 1 mon, 1 mgr and 12 OSDs. One of the hosts, darkside1, has been out of quorum according to ceph status. Systemd showed 4 services dead, two mons and two mgrs. I managed to systemctl restart one mon and one mgr, but even after several attempts, the remaining mon and mgr services, when asked to restart, keep returning to a failed state after a few seconds. They try to auto-restart and then go into a failed state where systemd requires me to manually set them to "reset-failed" before trying to start again. But they never stay up. There are no clear messages about the issue in /var/log/ceph/cephadm.log. The host is still out of quorum. I have failed to "turn on debug" as per https://docs.ceph.com/en/pacific/rados/troubleshooting/log-and-debug/. It seems I do not know the proper incantantion for "ceph daemon X config show", no string for X seems to satisfy this command. I have tried adding this: [mon] debug mon = 20 To my ceph.conf, but no additional lines of log are sent to /var/log/cephadm.log so I'm sorry I can´t provide more details. Could someone help me debug this situation? I am sure that if just reboot the machine, it will start up the services properly, as it always has done, but I would prefer to fix this without this action. Cordially, Renata. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Unable to change port range for OSDs in Pacific
Good morning Eugen! Thank you, this allowed me to succcessfully migrate my OSDs to ports above 6830. This in turn prevents the conflict with slurmd. Cordially, Renata. On 5/18/23 18:26, Eugen Block wrote: Hi, the config options you mention should work, but not in the ceph.conf. You should set it via ‚ceph config set …‘ and then restart the daemons (ceph orch daemon restart osd). Zitat von Renata Callado Borges : Dear all, How are you? I have a Pacific 3 nodes cluster, and the machines do double-duty as Ceph nodes and as Slurm clients. (I am well aware that this is not desirable, but my client wants it like this anyway). Our Slurm install uses the port 6818 for slurmd everywhere. In one of our Ceph/Slurm nodes, Ceph decided that port 6818 is great for an OSD. This prevents slurmd from running properly. Changing the slurmd port causes the Slurm master, slurmctld, to misread the OSD communication as Slurm "Insane length messages". I have tried unsuccessfully to change this port in Slurm and in Ceph. I wonder if someone here can help me limit the ports Ceph uses for its OSDs. I have tried this with no success: [root@darkside1 ~]# cat /etc/ceph/ceph.conf~ # minimal ceph.conf for 1902a026-496d-11ed-b43e-08c0eb320ec2 [global] fsid = 1902a026-496d-11ed-b43e-08c0eb320ec2 mon_host = [v2:172.22.132.188:3300/0,v1:172.22.132.188:6789/0] [osd] ms_bind_port_min = 6830 ms_bind_port_max = 7300 Then I restarted with "systemctl restart ceph.target" but the OSD keeps being re-bound to 6818. I also tried the same config but with the options under the [global] section. No luck there either. Tried reboot the Ceph/Slurm machine, the OSD is re-bound in 6818 also. Could someone help? Thanks in advance! Cordially, Renata. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Unable to change port range for OSDs in Pacific
Dear all, How are you? I have a Pacific 3 nodes cluster, and the machines do double-duty as Ceph nodes and as Slurm clients. (I am well aware that this is not desirable, but my client wants it like this anyway). Our Slurm install uses the port 6818 for slurmd everywhere. In one of our Ceph/Slurm nodes, Ceph decided that port 6818 is great for an OSD. This prevents slurmd from running properly. Changing the slurmd port causes the Slurm master, slurmctld, to misread the OSD communication as Slurm "Insane length messages". I have tried unsuccessfully to change this port in Slurm and in Ceph. I wonder if someone here can help me limit the ports Ceph uses for its OSDs. I have tried this with no success: [root@darkside1 ~]# cat /etc/ceph/ceph.conf~ # minimal ceph.conf for 1902a026-496d-11ed-b43e-08c0eb320ec2 [global] fsid = 1902a026-496d-11ed-b43e-08c0eb320ec2 mon_host = [v2:172.22.132.188:3300/0,v1:172.22.132.188:6789/0] [osd] ms_bind_port_min = 6830 ms_bind_port_max = 7300 Then I restarted with "systemctl restart ceph.target" but the OSD keeps being re-bound to 6818. I also tried the same config but with the options under the [global] section. No luck there either. Tried reboot the Ceph/Slurm machine, the OSD is re-bound in 6818 also. Could someone help? Thanks in advance! Cordially, Renata. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Octopus mgr doesn't resume after boot
Hi all! I want to register in this thread the debug and solution of this problem, for future reference. Thanks to Murilo Morais who did all the debugging! The issue happened because when the machine rebooted, it applied a new sshd configuration that prevented root ssh connections. Specifically, I had added an "AllowGroups" line on /etc/ssh/sshd_config to prevent my users from ssh'ing into this machine. For root to continue to be allowed to ssh login with this parameter, I added the "root" group in the "AllowGroups" variable. Perhaps there better solutions, but this works. After fixing this, I restarted ceph.target on the problematic host and did "ceph orch host add darkside3" on the main mgr. Now everything works. Cordially, Renata. On 24/01/2023 14:31, Renata Callado Borges wrote: Dear all, I have a two hosts setup, and I recently rebooted a mgr machine without "set noout" and "set norebalance" commands. The "darkside2" machine is the cephadm machine, and "darkside3" is the improperly rebooted mgr. Now the darkside3 machine does not resume ceph configuration: [root@darkside2 ~]# ceph orch host ls HOST ADDR LABELS STATUS darkside2 darkside2 darkside3 172.22.132.189 Offline If I understood the docs correctly, I should [root@darkside2 ~]# ceph orch host add darkside3 But this fails because darkside3 doesn't accept root ssh connnections. I presume this has been discussed before, but I couldn't find the correct thread. Could someone please point me in the right direction? Cordially, Renata. -- Renata Callado Borges Senior Systems Analyst - InCor +55 (11) 2661-4283 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Octopus mgr doesn't resume after boot
Dear all, I have a two hosts setup, and I recently rebooted a mgr machine without "set noout" and "set norebalance" commands. The "darkside2" machine is the cephadm machine, and "darkside3" is the improperly rebooted mgr. Now the darkside3 machine does not resume ceph configuration: [root@darkside2 ~]# ceph orch host ls HOST ADDR LABELS STATUS darkside2 darkside2 darkside3 172.22.132.189 Offline If I understood the docs correctly, I should [root@darkside2 ~]# ceph orch host add darkside3 But this fails because darkside3 doesn't accept root ssh connnections. I presume this has been discussed before, but I couldn't find the correct thread. Could someone please point me in the right direction? Cordially, Renata. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io