[ceph-users] Re: Failing to restart mon and mgr daemons on Pacific

2023-07-25 Thread Renata Callado Borges

Hi Adam!


I guess you only want the output for the 9100 port?

[root@darkside1]# ss -tulpn | grep 9100
tcp    LISTEN 0  128    [::]:9100 [::]:*   
users:(("node_exporter",pid=9103,fd=3))


Also, this:

[root@darkside1 ~]# ps aux | grep 9103
nfsnobo+   9103 38.4  0.0 152332 105760 ?   Ssl  10:12  82:35 
/bin/node_exporter --no-collector.timex



Cordially,

Renata.

On 7/25/23 13:22, Adam King wrote:
okay, not much info on the mon failure. The other one at least seems 
to be a simple port conflict. What does `sudo netstat -tulpn` give you 
on that host?


On Tue, Jul 25, 2023 at 12:00 PM Renata Callado Borges 
 wrote:


Hi Adam!


Thank you for your response, but I am still trying to figure out the
issue. I am pretty sure the problem occurs "inside" the container,
and I
don´t  know how to get logs from there.

Just in case, this is what systemd sees:


Jul 25 12:36:32 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for
920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:32 darkside1 systemd[1]: Starting Ceph mon.darkside1 for
920740ee-cf2d-11ed-9097-08c0eb320eda...
Jul 25 12:36:33 darkside1 bash[52271]:
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1
Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.32233695
-0300 -03 m=+0.131005321 container create
7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
(image=quay.io/ceph/ceph:v15 <http://quay.io/ceph/ceph:v15>,
name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1)
Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25
12:36:33.526241218
-0300 -03 m=+0.334909578 container init 7cf1d340e0a9658b2c9da
c3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
(image=quay.io/ceph/ceph:v15 <http://quay.io/ceph/ceph:v15>,
name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksid
e1)
Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25
12:36:33.556646854
-0300 -03 m=+0.365315225 container start 7cf1d340e0a9658b2c9d
ac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
(image=quay.io/ceph/ceph:v15 <http://quay.io/ceph/ceph:v15>,
name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksi
de1)
Jul 25 12:36:33 darkside1 bash[52271]:
7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
Jul 25 12:36:33 darkside1 systemd[1]: Started Ceph mon.darkside1 for
920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:43 darkside1 systemd[1]:
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service: main
process exited, code=exi
ted, status=1/FAILURE
Jul 25 12:36:43 darkside1 systemd[1]: Unit
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service
entered
failed state.
Jul 25 12:36:43 darkside1 systemd[1]:
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service
failed.
Jul 25 12:36:53 darkside1 systemd[1]:
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service
holdoff
time over, scheduling
restart.
Jul 25 12:36:53 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for
920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:53 darkside1 systemd[1]: start request repeated too
quickly
for ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1
.service
Jul 25 12:36:53 darkside1 systemd[1]: Failed to start Ceph
mon.darkside1
for 920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:53 darkside1 systemd[1]: Unit
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service
entered
failed state.
Jul 25 12:36:53 darkside1 systemd[1]:
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service
failed.


Also, I get the following error every 10 minutes or so on "ceph -W
cephadm --watch-debug":


2023-07-25T12:35:38.115146-0300 mgr.darkside3.ujjyun [INF] Deploying
daemon node-exporter.darkside1 on darkside1
2023-07-25T12:35:38.612569-0300 mgr.darkside3.ujjyun [ERR] cephadm
exited with an error code: 1, stderr:Deploy daemon node-exporter.
darkside1 ...
Verifying port 9100 ...
Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
ERROR: TCP Port(s) '9100' required for node-exporter already in use
Traceback (most recent call last):
   File "/usr/share/ceph/mgr/cephadm/module.py", line 1029, in
_remote_connection
 yield (conn, connr)
   File "/usr/share/ceph/mgr/cephadm/module.py", line 1185, in
_run_cephadm
 code, '\n'.join(err)))
orchestrator._interface.OrchestratorError: cephadm exited with an
error
code: 1, stderr:Deploy daemon node-exporter.darkside1 ...
Verifying port 9100 ...
Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
ERROR: TCP Port(s) '9100' required for node-exporter already in use

And finally I get th

[ceph-users] Re: Failing to restart mon and mgr daemons on Pacific

2023-07-25 Thread Renata Callado Borges
For cephadm deployments, the systemd unit is run 
through a "unit.run" file in 
/var/lib/ceph///unit.run. If you go to the 
very end of that file, which will be a very long podman or docker run 
command, add in the "--debug_ms 20" and then restart the systemd unit 
for that daemon, it should cause the extra debug logging to happen 
from that daemon. I would say first check if there are useful errors 
in the journal logs mentioned above before trying that though.


On Mon, Jul 24, 2023 at 9:47 AM Renata Callado Borges 
 wrote:


Dear all,


How are you?

I have a cluster on Pacific with 3 hosts, each one with 1 mon,  1 mgr
and 12 OSDs.

One of the hosts, darkside1, has been out of quorum according to ceph
status.

Systemd showed 4 services dead, two mons and two mgrs.

I managed to systemctl restart one mon and one mgr, but even after
several attempts, the remaining mon and mgr services, when asked to
restart, keep returning to a failed state after a few seconds.
They try
to auto-restart and then go into a failed state where systemd
requires
me to manually set them to "reset-failed" before trying to start
again.
But they never stay up. There are no clear messages about the
issue in
/var/log/ceph/cephadm.log.

The host is still out of quorum.


I have failed to "turn on debug" as per
https://docs.ceph.com/en/pacific/rados/troubleshooting/log-and-debug/.

It seems I do not know the proper incantantion for "ceph daemon X
config
show", no string for X seems to satisfy this command. I have tried
adding this:

[mon]

  debug mon = 20


To my ceph.conf, but no additional lines of log are sent to
/var/log/cephadm.log


  so I'm sorry I can´t provide more details.


Could someone help me debug this situation? I am sure that if just
reboot the machine, it will start up the services properly, as it
always
has done, but I would prefer to fix this without this action.


Cordially,

Renata.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Failing to restart mon and mgr daemons on Pacific

2023-07-24 Thread Renata Callado Borges

Dear all,


How are you?

I have a cluster on Pacific with 3 hosts, each one with 1 mon,  1 mgr 
and 12 OSDs.


One of the hosts, darkside1, has been out of quorum according to ceph 
status.


Systemd showed 4 services dead, two mons and two mgrs.

I managed to systemctl restart one mon and one mgr, but even after 
several attempts, the remaining mon and mgr services, when asked to 
restart, keep returning to a failed state after a few seconds. They try 
to auto-restart and then go into a failed state where systemd requires 
me to manually set them to "reset-failed" before trying to start again. 
But they never stay up. There are no clear messages about the issue in 
/var/log/ceph/cephadm.log.


The host is still out of quorum.


I have failed to "turn on debug" as per 
https://docs.ceph.com/en/pacific/rados/troubleshooting/log-and-debug/. 
It seems I do not know the proper incantantion for "ceph daemon X config 
show", no string for X seems to satisfy this command. I have tried 
adding this:


[mon]

 debug mon = 20


To my ceph.conf, but no additional lines of log are sent to 
/var/log/cephadm.log



 so I'm sorry I can´t provide more details.


Could someone help me debug this situation? I am sure that if just 
reboot the machine, it will start up the services properly, as it always 
has done, but I would prefer to fix this without this action.



Cordially,

Renata.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unable to change port range for OSDs in Pacific

2023-05-23 Thread Renata Callado Borges

Good morning Eugen!


Thank you, this allowed me to succcessfully migrate my OSDs to ports 
above 6830. This in turn prevents the conflict with slurmd.



Cordially,

Renata.

On 5/18/23 18:26, Eugen Block wrote:

Hi,
the config options you mention should work, but not in the ceph.conf. 
You should set it via ‚ceph config set …‘ and then restart the daemons 
(ceph orch daemon restart osd).


Zitat von Renata Callado Borges :


Dear all,


How are you?

I have a Pacific 3 nodes cluster, and the machines do double-duty as 
Ceph nodes and as Slurm clients. (I am well aware that this is not 
desirable, but my client wants it like this anyway).


Our Slurm install uses the port 6818 for slurmd everywhere.

In one of our Ceph/Slurm nodes, Ceph decided that port 6818 is great 
for an OSD. This prevents slurmd from running properly. Changing the 
slurmd port causes the Slurm master, slurmctld, to misread the OSD 
communication as Slurm "Insane length messages".


I have tried unsuccessfully to change this port in Slurm and in Ceph. 
I wonder if someone here can help me limit the ports Ceph uses for 
its OSDs.


I have tried this with no success:

[root@darkside1 ~]# cat /etc/ceph/ceph.conf~
# minimal ceph.conf for 1902a026-496d-11ed-b43e-08c0eb320ec2
[global]
    fsid = 1902a026-496d-11ed-b43e-08c0eb320ec2
    mon_host = [v2:172.22.132.188:3300/0,v1:172.22.132.188:6789/0]

[osd]
    ms_bind_port_min = 6830
    ms_bind_port_max = 7300


Then I restarted with "systemctl restart ceph.target" but the OSD 
keeps being re-bound to 6818. I also tried the same config but with 
the options under the [global] section. No luck there either. Tried 
reboot the Ceph/Slurm machine, the OSD is re-bound in 6818 also.


Could someone help? Thanks in advance!


Cordially,

Renata.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Unable to change port range for OSDs in Pacific

2023-05-18 Thread Renata Callado Borges

Dear all,


How are you?

I have a Pacific 3 nodes cluster, and the machines do double-duty as 
Ceph nodes and as Slurm clients. (I am well aware that this is not 
desirable, but my client wants it like this anyway).


Our Slurm install uses the port 6818 for slurmd everywhere.

In one of our Ceph/Slurm nodes, Ceph decided that port 6818 is great for 
an OSD. This prevents slurmd from running properly. Changing the slurmd 
port causes the Slurm master, slurmctld, to misread the OSD 
communication as Slurm "Insane length messages".


I have tried unsuccessfully to change this port in Slurm and in Ceph. I 
wonder if someone here can help me limit the ports Ceph uses for its OSDs.


I have tried this with no success:

[root@darkside1 ~]# cat /etc/ceph/ceph.conf~
# minimal ceph.conf for 1902a026-496d-11ed-b43e-08c0eb320ec2
[global]
    fsid = 1902a026-496d-11ed-b43e-08c0eb320ec2
    mon_host = [v2:172.22.132.188:3300/0,v1:172.22.132.188:6789/0]

[osd]
    ms_bind_port_min = 6830
    ms_bind_port_max = 7300


Then I restarted with "systemctl restart ceph.target" but the OSD keeps 
being re-bound to 6818. I also tried the same config but with the 
options under the [global] section. No luck there either. Tried reboot 
the Ceph/Slurm machine, the OSD is re-bound in 6818 also.


Could someone help? Thanks in advance!


Cordially,

Renata.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Octopus mgr doesn't resume after boot

2023-01-26 Thread Renata Callado Borges

Hi all!


I want to register in this thread the debug and solution of this 
problem, for future reference.


Thanks to Murilo Morais who did all the debugging!


The issue happened because when the machine rebooted, it applied a new 
sshd configuration that prevented root ssh connections. Specifically, I 
had added an "AllowGroups" line on /etc/ssh/sshd_config to prevent my 
users from ssh'ing into this machine.


For root to continue to be allowed to ssh login with this parameter, I 
added the "root" group in the "AllowGroups" variable. Perhaps there 
better solutions, but this works.


After fixing this, I restarted ceph.target on the problematic host and 
did "ceph orch host add darkside3" on the main mgr. Now everything works.



Cordially,

Renata.

On 24/01/2023 14:31, Renata Callado Borges wrote:

Dear all,


I have a two hosts setup, and I recently rebooted a mgr machine 
without "set noout" and "set norebalance" commands.


The "darkside2" machine is the cephadm machine, and "darkside3" is the 
improperly rebooted mgr.


Now the darkside3 machine does not resume ceph configuration:

[root@darkside2 ~]# ceph orch host ls
HOST   ADDR    LABELS  STATUS
darkside2  darkside2
darkside3  172.22.132.189  Offline

If I understood the docs correctly, I should

[root@darkside2 ~]# ceph orch host add darkside3

But this fails because darkside3 doesn't accept root ssh connnections.

I presume this has been discussed before, but I couldn't find the 
correct thread. Could someone please point me in the right direction?



Cordially,

Renata.


--
Renata Callado Borges
Senior Systems Analyst - InCor
+55 (11) 2661-4283
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Octopus mgr doesn't resume after boot

2023-01-24 Thread Renata Callado Borges

Dear all,


I have a two hosts setup, and I recently rebooted a mgr machine without 
"set noout" and "set norebalance" commands.


The "darkside2" machine is the cephadm machine, and "darkside3" is the 
improperly rebooted mgr.


Now the darkside3 machine does not resume ceph configuration:

[root@darkside2 ~]# ceph orch host ls
HOST   ADDR    LABELS  STATUS
darkside2  darkside2
darkside3  172.22.132.189  Offline

If I understood the docs correctly, I should

[root@darkside2 ~]# ceph orch host add darkside3

But this fails because darkside3 doesn't accept root ssh connnections.

I presume this has been discussed before, but I couldn't find the 
correct thread. Could someone please point me in the right direction?



Cordially,

Renata.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io