[ceph-users] Re: docker restarting lost all managers accidentally

2023-05-10 Thread Adam King
in /var/lib/ceph// on the host with that mgr
reporting the error, there should be a unit.run file that shows what is
being done to start the mgr as well as a few files that get mounted into
the mgr on startup, notably the "config" and "keyring" files. That config
file should include the mon host addresses. E.g.

[root@vm-01 ~]# cat
/var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/config
# minimal ceph.conf for 5a72983c-ef57-11ed-a389-525400e42d74
[global]
fsid = 5a72983c-ef57-11ed-a389-525400e42d74
mon_host = [v2:192.168.122.75:3300/0,v1:192.168.122.75:6789/0] [v2:
192.168.122.246:3300/0,v1:192.168.122.246:6789/0] [v2:
192.168.122.97:3300/0,v1:192.168.122.97:6789/0]

The first thing I'd do is probably make sure that array of addresses is
correct.

Then you could probably check the keyring file as well and see if it
matches up with what you get running "ceph auth get ".
E.g. here

[root@vm-01 ~]# cat
/var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/keyring
[mgr.vm-01.ilfvis]
key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ==

the key matches with

[ceph: root@vm-00 /]# ceph auth get mgr.vm-01.ilfvis
[mgr.vm-01.ilfvis]
key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ==
caps mds = "allow *"
caps mon = "profile mgr"
caps osd = "allow *"

I wouldn't post them for obvious reasons (these are just on a test cluster
I'll tear back down so it's fine for me) but those are the first couple
things I'd check. You could also try to make adjustments directly to the
unit.run file if you have other things you'd like to try.

On Wed, May 10, 2023 at 11:09 AM Ben  wrote:

> Hi,
> This cluster is deployed by cephadm 17.2.5,containerized.
> It ends up in this(no active mgr):
> [root@8cd2c0657c77 /]# ceph -s
>   cluster:
> id: ad3a132e-e9ee-11ed-8a19-043f72fb8bf9
> health: HEALTH_WARN
> 6 hosts fail cephadm check
> no active mgr
> 1/3 mons down, quorum h18w,h19w
> Degraded data redundancy: 781908/2345724 objects degraded
> (33.333%), 101 pgs degraded, 209 pgs undersized
>
>   services:
> mon: 3 daemons, quorum h18w,h19w (age 19m), out of quorum: h15w
> mgr: no daemons active (since 5h)
> mds: 1/1 daemons up, 1 standby
> osd: 9 osds: 6 up (since 5h), 6 in (since 5h)
> rgw: 2 daemons active (2 hosts, 1 zones)
>
>   data:
> volumes: 1/1 healthy
> pools:   8 pools, 209 pgs
> objects: 781.91k objects, 152 GiB
> usage:   312 GiB used, 54 TiB / 55 TiB avail
> pgs: 781908/2345724 objects degraded (33.333%)
>  108 active+undersized
>  101 active+undersized+degraded
>
> I checked the h20w, there is a manager container running with log:
>
> debug 2023-05-10T12:43:23.315+ 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T12:48:23.318+ 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T12:53:23.318+ 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T12:58:23.319+ 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T13:03:23.319+ 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T13:08:23.319+ 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T13:13:23.319+ 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
>
> any idea to get a mgr up running again through cephadm?
>
> Thanks,
> Ben
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: docker restarting lost all managers accidentally

2023-05-11 Thread Ben
along the path you mentioned, it is fixed by changing the owner of
/var/lib/ceph to 167:167 from root. The cluster was deployed with non root
user, and files permission is in a bit of mess. After the change systemctl
daemon-reload and restart brings it up.

for another manager in bootstrap host, journal logs complains the following:
ay 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+
7f6b9bba5000 -1 monclient: keyring not found
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+
7f6b9bba5000 -1 auth: failed to load
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: (5) Input/output error
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+
7f6b9bba5000 -1 auth: error parsing file
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: error setting modifi>
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+
7f6b9bba5000 -1 auth: failed to load
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: (5) Input/output error
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+
7f6b9bba5000 -1 auth: error parsing file
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: error setting modifi>
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+
7f6b9bba5000 -1 auth: failed to load
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: (5) Input/output error
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+
7f6b9bba5000 -1 auth: error parsing file
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: error setting modifi>


the keyring has a base64 string and makes it original. mgr then is up as
well. There seems something inconsistent in bootstrapping a cluster.

Thank you all for help. It is now normal again.

Adam King  于2023年5月11日周四 01:33写道:

> in /var/lib/ceph// on the host with that mgr
> reporting the error, there should be a unit.run file that shows what is
> being done to start the mgr as well as a few files that get mounted into
> the mgr on startup, notably the "config" and "keyring" files. That config
> file should include the mon host addresses. E.g.
>
> [root@vm-01 ~]# cat
> /var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/config
> # minimal ceph.conf for 5a72983c-ef57-11ed-a389-525400e42d74
> [global]
> fsid = 5a72983c-ef57-11ed-a389-525400e42d74
> mon_host = [v2:192.168.122.75:3300/0,v1:192.168.122.75:6789/0] [v2:
> 192.168.122.246:3300/0,v1:192.168.122.246:6789/0] [v2:
> 192.168.122.97:3300/0,v1:192.168.122.97:6789/0]
>
> The first thing I'd do is probably make sure that array of addresses is
> correct.
>
> Then you could probably check the keyring file as well and see if it
> matches up with what you get running "ceph auth get ".
> E.g. here
>
> [root@vm-01 ~]# cat
> /var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/keyring
> [mgr.vm-01.ilfvis]
> key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ==
>
> the key matches with
>
> [ceph: root@vm-00 /]# ceph auth get mgr.vm-01.ilfvis
> [mgr.vm-01.ilfvis]
> key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ==
> caps mds = "allow *"
> caps mon = "profile mgr"
> caps osd = "allow *"
>
> I wouldn't post them for obvious reasons (these are just on a test cluster
> I'll tear back down so it's fine for me) but those are the first couple
> things I'd check. You could also try to make adjustments directly to the
> unit.run file if you have other things you'd like to try.
>
> On Wed, May 10, 2023 at 11:09 AM Ben  wrote:
>
>> Hi,
>> This cluster is deployed by cephadm 17.2.5,containerized.
>> It ends up in this(no active mgr):
>> [root@8cd2c0657c77 /]# ceph -s
>>   cluster:
>> id: ad3a132e-e9ee-11ed-8a19-043f72fb8bf9
>> health: HEALTH_WARN
>> 6 hosts fail cephadm check
>> no active mgr
>> 1/3 mons down, quorum h18w,h19w
>> Degraded data redundancy: 781908/2345724 objects degraded
>> (33.333%), 101 pgs degraded, 209 pgs undersized
>>
>>   services:
>> mon: 3 daemons, quorum h18w,h19w (age 19m), out of quorum: h15w
>> mgr: no daemons active (since 5h)
>> mds: 1/1 daemons up, 1 standby
>> osd: 9 osds: 6 up (since 5h), 6 in (since 5h)
>> rgw: 2 daemons active (2 hosts, 1 zones)
>>
>>   data:
>> volumes: 1/1 healthy
>> pools:   8 pools, 209 pgs
>> objects: 781.91k objects, 152 GiB
>> usage:   312 GiB used, 54 TiB / 55 TiB avail
>> pgs: 781908/2345724 objects degraded (33.333%)
>>  108 active+undersized
>>  101 active+undersized+degraded
>>
>> I checked the h20w, there is a manager container running with log:
>>
>> debug 2023-05-10T12:43:23.315+ 7f5e152ec000  0 monclient(hunting):
>> authenticate timed out after 300
>>
>> debug 2023-05-10T12:48:23.318+ 7f5e152ec000  0 monclient(hunting):
>> authenticate timed out after 300
>>
>> debug 2023-05-10T12:53:23.318+ 7f5e152ec000  0 monclient(hunting):
>> authenticate timed out after 300
>>
>> debug 2023-05-10T12:58:23.319+ 7f5e152ec000  0 monclient(hunting):
>> authenticate timed out after