[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-08-12 Thread David Orman
https://github.com/ceph/ceph/pull/42690 looks like it might be a fix,
but it's pending review.

On Thu, Aug 12, 2021 at 7:46 AM André Gemünd
 wrote:
>
> We're seeing the same here with v16.2.5 on CentOS 8.3
>
> Do you know of any progress?
>
> Best Greetings
> André
>
> - Am 9. Aug 2021 um 18:15 schrieb David Orman orma...@corenode.com:
>
> > Hi,
> >
> > We are seeing very similar behavior on 16.2.5, and also have noticed
> > that an undeploy/deploy cycle fixes things. Before we go rummaging
> > through the source code trying to determine the root cause, has
> > anybody else figured this out? It seems odd that a repeatable issue
> > (I've seen other mailing list posts about this same issue) impacting
> > 16.2.4/16.2.5, at least, on reboots hasn't been addressed yet, so
> > wanted to check.
> >
> > Here's one of the other thread titles that appears related:
> > "[ceph-users] mons assigned via orch label 'committing suicide' upon
> > reboot."
> >
> > Respectfully,
> > David
> >
> >
> > On Sun, May 23, 2021 at 3:40 AM Adrian Nicolae
> >  wrote:
> >>
> >> Hi guys,
> >>
> >> I'm testing Ceph Pacific 16.2.4 in my lab before deciding if I will put
> >> it in production on a 1PB+ storage cluster with rgw-only access.
> >>
> >> I noticed a weird issue with my mons :
> >>
> >> - if I reboot a mon host, the ceph-mon container is not starting after
> >> reboot
> >>
> >> - I can see with 'ceph orch ps' the following output :
> >>
> >> mon.node01   node01   running (20h)   4m ago
> >> 20h   16.2.4 8d91d370c2b8  0a2e86af94b2
> >> mon.node02   node02   running (115m)  12s ago
> >> 115m  16.2.4 8d91d370c2b8  51f4885a1b06
> >> mon.node03   node03   stopped 4m ago
> >> 19h  
> >>
> >> (where node03 is the host which was rebooted).
> >>
> >> - I tried to start the mon container manually on node03 with '/bin/bash
> >> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run'
> >> and I've got the following output :
> >>
> >> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
> >> mon.node03@-1(???).osd e408 crush map has features 3314933069573799936,
> >> adjusting msgr requires
> >> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
> >> mon.node03@-1(???).osd e408 crush map has features 43262930805112,
> >> adjusting msgr requires
> >> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
> >> mon.node03@-1(???).osd e408 crush map has features 43262930805112,
> >> adjusting msgr requires
> >> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
> >> mon.node03@-1(???).osd e408 crush map has features 43262930805112,
> >> adjusting msgr requires
> >> cluster 2021-05-23T08:07:12.189243+ mgr.node01.ksitls (mgr.14164)
> >> 36380 : cluster [DBG] pgmap v36392: 417 pgs: 417 active+clean; 33 KiB
> >> data, 605 MiB used, 651 GiB / 652 GiB avail; 9.6 KiB/s rd, 0 B/s wr, 15 
> >> op/s
> >> debug 2021-05-23T08:24:25.196+ 7f9a9e358700  1
> >> mon.node03@-1(???).paxosservice(auth 1..51) refresh upgraded, format 0 -> 3
> >> debug 2021-05-23T08:24:25.208+ 7f9a88176700  1 heartbeat_map
> >> reset_timeout 'Monitor::cpu_tp thread 0x7f9a88176700' had timed out
> >> after 0.0s
> >> debug 2021-05-23T08:24:25.208+ 7f9a9e358700  0
> >> mon.node03@-1(probing) e5  my rank is now 1 (was -1)
> >> debug 2021-05-23T08:24:25.212+ 7f9a87975700  0 mon.node03@1(probing)
> >> e6  removed from monmap, suicide.
> >>
> >> root@node03:/home/adrian# systemctl status
> >> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service
> >> ● ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service - Ceph
> >> mon.node03 for c2d41ac4-baf5-11eb-865d-2dc838a337a3
> >>   Loaded: loaded
> >> (/etc/systemd/system/ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@.service;
> >> enabled; vendor preset: enabled)
> >>   Active: inactive (dead) since Sun 2021-05-23 08:10:00 UTC; 16min ago
> >>  Process: 1176 ExecStart=/bin/bash
> >> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run
> >> (code=exited, status=0/SUCCESS)
> >>  Process: 1855 ExecStop=/usr/bin/docker stop
> >> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3-mon.node03 (code=exited,
> >> status=1/FAILURE)
> >>  Process: 1861 ExecStopPost=/bin/bash
> >> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.poststop
> >> (code=exited, status=0/SUCCESS)
> >> Main PID: 1176 (code=exited, status=0/SUCCESS)
> >>
> >> The only fix I could find was to redeploy the mon with :
> >>
> >> ceph orch daemon rm  mon.node03 --force
> >> ceph orch daemon add mon node03
> >>
> >> However, even if it's working after redeploy, it's not giving me a lot
> >> of trust to use it in a production environment having an issue like
> >> that.  I could reproduce it with 2 different mons so it's not just an
> >> exception.
> >>
> >> My setup is based on Ubuntu 20.04 and docker instead of podman :
> >>
> >> root@node01:~# docker -v
> >> Docker version 

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-08-12 Thread André Gemünd
We're seeing the same here with v16.2.5 on CentOS 8.3

Do you know of any progress?

Best Greetings
André

- Am 9. Aug 2021 um 18:15 schrieb David Orman orma...@corenode.com:

> Hi,
> 
> We are seeing very similar behavior on 16.2.5, and also have noticed
> that an undeploy/deploy cycle fixes things. Before we go rummaging
> through the source code trying to determine the root cause, has
> anybody else figured this out? It seems odd that a repeatable issue
> (I've seen other mailing list posts about this same issue) impacting
> 16.2.4/16.2.5, at least, on reboots hasn't been addressed yet, so
> wanted to check.
> 
> Here's one of the other thread titles that appears related:
> "[ceph-users] mons assigned via orch label 'committing suicide' upon
> reboot."
> 
> Respectfully,
> David
> 
> 
> On Sun, May 23, 2021 at 3:40 AM Adrian Nicolae
>  wrote:
>>
>> Hi guys,
>>
>> I'm testing Ceph Pacific 16.2.4 in my lab before deciding if I will put
>> it in production on a 1PB+ storage cluster with rgw-only access.
>>
>> I noticed a weird issue with my mons :
>>
>> - if I reboot a mon host, the ceph-mon container is not starting after
>> reboot
>>
>> - I can see with 'ceph orch ps' the following output :
>>
>> mon.node01   node01   running (20h)   4m ago
>> 20h   16.2.4 8d91d370c2b8  0a2e86af94b2
>> mon.node02   node02   running (115m)  12s ago
>> 115m  16.2.4 8d91d370c2b8  51f4885a1b06
>> mon.node03   node03   stopped 4m ago
>> 19h  
>>
>> (where node03 is the host which was rebooted).
>>
>> - I tried to start the mon container manually on node03 with '/bin/bash
>> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run'
>> and I've got the following output :
>>
>> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
>> mon.node03@-1(???).osd e408 crush map has features 3314933069573799936,
>> adjusting msgr requires
>> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
>> mon.node03@-1(???).osd e408 crush map has features 43262930805112,
>> adjusting msgr requires
>> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
>> mon.node03@-1(???).osd e408 crush map has features 43262930805112,
>> adjusting msgr requires
>> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
>> mon.node03@-1(???).osd e408 crush map has features 43262930805112,
>> adjusting msgr requires
>> cluster 2021-05-23T08:07:12.189243+ mgr.node01.ksitls (mgr.14164)
>> 36380 : cluster [DBG] pgmap v36392: 417 pgs: 417 active+clean; 33 KiB
>> data, 605 MiB used, 651 GiB / 652 GiB avail; 9.6 KiB/s rd, 0 B/s wr, 15 op/s
>> debug 2021-05-23T08:24:25.196+ 7f9a9e358700  1
>> mon.node03@-1(???).paxosservice(auth 1..51) refresh upgraded, format 0 -> 3
>> debug 2021-05-23T08:24:25.208+ 7f9a88176700  1 heartbeat_map
>> reset_timeout 'Monitor::cpu_tp thread 0x7f9a88176700' had timed out
>> after 0.0s
>> debug 2021-05-23T08:24:25.208+ 7f9a9e358700  0
>> mon.node03@-1(probing) e5  my rank is now 1 (was -1)
>> debug 2021-05-23T08:24:25.212+ 7f9a87975700  0 mon.node03@1(probing)
>> e6  removed from monmap, suicide.
>>
>> root@node03:/home/adrian# systemctl status
>> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service
>> ● ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service - Ceph
>> mon.node03 for c2d41ac4-baf5-11eb-865d-2dc838a337a3
>>   Loaded: loaded
>> (/etc/systemd/system/ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@.service;
>> enabled; vendor preset: enabled)
>>   Active: inactive (dead) since Sun 2021-05-23 08:10:00 UTC; 16min ago
>>  Process: 1176 ExecStart=/bin/bash
>> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run
>> (code=exited, status=0/SUCCESS)
>>  Process: 1855 ExecStop=/usr/bin/docker stop
>> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3-mon.node03 (code=exited,
>> status=1/FAILURE)
>>  Process: 1861 ExecStopPost=/bin/bash
>> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.poststop
>> (code=exited, status=0/SUCCESS)
>> Main PID: 1176 (code=exited, status=0/SUCCESS)
>>
>> The only fix I could find was to redeploy the mon with :
>>
>> ceph orch daemon rm  mon.node03 --force
>> ceph orch daemon add mon node03
>>
>> However, even if it's working after redeploy, it's not giving me a lot
>> of trust to use it in a production environment having an issue like
>> that.  I could reproduce it with 2 different mons so it's not just an
>> exception.
>>
>> My setup is based on Ubuntu 20.04 and docker instead of podman :
>>
>> root@node01:~# docker -v
>> Docker version 20.10.6, build 370c289
>>
>> Do you know a workaround for this issue or is this a known bug ? I
>> noticed that there are some other complaints with the same behaviour in
>> Octopus as well and the solution at that time was to delete the
>> /var/lib/ceph/mon folder .
>>
>>
>> Thanks.
>>
>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing 

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-08-10 Thread David Orman
Just adding our feedback - this is affecting us as well. We reboot
periodically to test durability of the clusters we run, and this is
fairly impactful. I could see power loss/other scenarios in which this
could end quite poorly for those with less than perfect redundancy in
DCs across multiple racks/PDUs/etc. I see
https://github.com/ceph/ceph/pull/42690 has been submitted, but I'd
definitely make an argument for it being a 'very high' priority, so it
hopefully gets a review in time for 16.2.6. :)

David

On Tue, Aug 10, 2021 at 4:36 AM Sebastian Wagner  wrote:
>
> Good morning Robert,
>
> Am 10.08.21 um 09:53 schrieb Robert Sander:
> > Hi,
> >
> > Am 09.08.21 um 20:44 schrieb Adam King:
> >
> >> This issue looks the same as https://tracker.ceph.com/issues/51027
> >> which is
> >> being worked on. Essentially, it seems that hosts that were being
> >> rebooted
> >> were temporarily marked as offline and cephadm had an issue where it
> >> would
> >> try to remove all daemons (outside of osds I believe) from offline
> >> hosts.
> >
> > Sorry for maybe being rude but how on earth does one come up with the
> > idea to automatically remove components from a cluster where just one
> > node is currently rebooting without any operator interference?
>
> Obviously no one :-). We already have over 750 tests for the cephadm
> scheduler and I can foresee that we'll get some additional ones for this
> case as well.
>
> Kind regards,
>
> Sebastian
>
>
> >
> > Regards
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-08-10 Thread Sebastian Wagner

Good morning Robert,

Am 10.08.21 um 09:53 schrieb Robert Sander:

Hi,

Am 09.08.21 um 20:44 schrieb Adam King:

This issue looks the same as https://tracker.ceph.com/issues/51027 
which is
being worked on. Essentially, it seems that hosts that were being 
rebooted
were temporarily marked as offline and cephadm had an issue where it 
would
try to remove all daemons (outside of osds I believe) from offline 
hosts.


Sorry for maybe being rude but how on earth does one come up with the 
idea to automatically remove components from a cluster where just one 
node is currently rebooting without any operator interference?


Obviously no one :-). We already have over 750 tests for the cephadm 
scheduler and I can foresee that we'll get some additional ones for this 
case as well.


Kind regards,

Sebastian




Regards


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-08-10 Thread Robert Sander

Hi,

Am 09.08.21 um 20:44 schrieb Adam King:


This issue looks the same as https://tracker.ceph.com/issues/51027 which is
being worked on. Essentially, it seems that hosts that were being rebooted
were temporarily marked as offline and cephadm had an issue where it would
try to remove all daemons (outside of osds I believe) from offline hosts.


Sorry for maybe being rude but how on earth does one come up with the 
idea to automatically remove components from a cluster where just one 
node is currently rebooting without any operator interference?


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-08-09 Thread Adam King
Wanted to respond to the original thread I saw archived on this topic but I
wasn't subscribed to the mailing list yet so don't have the thread in my
inbox to reply to. Hopefully, those involved in that thread still see this.

This issue looks the same as https://tracker.ceph.com/issues/51027 which is
being worked on. Essentially, it seems that hosts that were being rebooted
were temporarily marked as offline and cephamd had an issue where it would
try to remove all daemons (outside of osds I believe) from offline hosts.
The pre-remove step for monitors was to remove it from the monmap, so this
would happen, but then the daemon itself would not be removed since the
host was temporarily inaccessible due to the reboot. When the host came
back up, the mon was restarted but it had already been removed from the
monmap so it gets stuck in a "stopped" state. A fix for this that stops
cephadm from trying to remove daemons from offline hosts is in the works.

A temporary workaround right now, as mentioned by Harry on that tracker, is
to get cephadm to actually remove the mon daemon by changing the placement
spec to not include the host with the broken mon. Then wait to see the mon
daemon was removed, and finally put the placement spec back to how it was
so the mon gets redeployed (and now hopefully runs normally).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-08-09 Thread David Orman
Hi,

We are seeing very similar behavior on 16.2.5, and also have noticed
that an undeploy/deploy cycle fixes things. Before we go rummaging
through the source code trying to determine the root cause, has
anybody else figured this out? It seems odd that a repeatable issue
(I've seen other mailing list posts about this same issue) impacting
16.2.4/16.2.5, at least, on reboots hasn't been addressed yet, so
wanted to check.

Here's one of the other thread titles that appears related:
"[ceph-users] mons assigned via orch label 'committing suicide' upon
reboot."

Respectfully,
David


On Sun, May 23, 2021 at 3:40 AM Adrian Nicolae
 wrote:
>
> Hi guys,
>
> I'm testing Ceph Pacific 16.2.4 in my lab before deciding if I will put
> it in production on a 1PB+ storage cluster with rgw-only access.
>
> I noticed a weird issue with my mons :
>
> - if I reboot a mon host, the ceph-mon container is not starting after
> reboot
>
> - I can see with 'ceph orch ps' the following output :
>
> mon.node01   node01   running (20h)   4m ago
> 20h   16.2.4 8d91d370c2b8  0a2e86af94b2
> mon.node02   node02   running (115m)  12s ago
> 115m  16.2.4 8d91d370c2b8  51f4885a1b06
> mon.node03   node03   stopped 4m ago
> 19h  
>
> (where node03 is the host which was rebooted).
>
> - I tried to start the mon container manually on node03 with '/bin/bash
> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run'
> and I've got the following output :
>
> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
> mon.node03@-1(???).osd e408 crush map has features 3314933069573799936,
> adjusting msgr requires
> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
> mon.node03@-1(???).osd e408 crush map has features 43262930805112,
> adjusting msgr requires
> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
> mon.node03@-1(???).osd e408 crush map has features 43262930805112,
> adjusting msgr requires
> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0
> mon.node03@-1(???).osd e408 crush map has features 43262930805112,
> adjusting msgr requires
> cluster 2021-05-23T08:07:12.189243+ mgr.node01.ksitls (mgr.14164)
> 36380 : cluster [DBG] pgmap v36392: 417 pgs: 417 active+clean; 33 KiB
> data, 605 MiB used, 651 GiB / 652 GiB avail; 9.6 KiB/s rd, 0 B/s wr, 15 op/s
> debug 2021-05-23T08:24:25.196+ 7f9a9e358700  1
> mon.node03@-1(???).paxosservice(auth 1..51) refresh upgraded, format 0 -> 3
> debug 2021-05-23T08:24:25.208+ 7f9a88176700  1 heartbeat_map
> reset_timeout 'Monitor::cpu_tp thread 0x7f9a88176700' had timed out
> after 0.0s
> debug 2021-05-23T08:24:25.208+ 7f9a9e358700  0
> mon.node03@-1(probing) e5  my rank is now 1 (was -1)
> debug 2021-05-23T08:24:25.212+ 7f9a87975700  0 mon.node03@1(probing)
> e6  removed from monmap, suicide.
>
> root@node03:/home/adrian# systemctl status
> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service
> ● ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service - Ceph
> mon.node03 for c2d41ac4-baf5-11eb-865d-2dc838a337a3
>   Loaded: loaded
> (/etc/systemd/system/ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@.service;
> enabled; vendor preset: enabled)
>   Active: inactive (dead) since Sun 2021-05-23 08:10:00 UTC; 16min ago
>  Process: 1176 ExecStart=/bin/bash
> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run
> (code=exited, status=0/SUCCESS)
>  Process: 1855 ExecStop=/usr/bin/docker stop
> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3-mon.node03 (code=exited,
> status=1/FAILURE)
>  Process: 1861 ExecStopPost=/bin/bash
> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.poststop
> (code=exited, status=0/SUCCESS)
> Main PID: 1176 (code=exited, status=0/SUCCESS)
>
> The only fix I could find was to redeploy the mon with :
>
> ceph orch daemon rm  mon.node03 --force
> ceph orch daemon add mon node03
>
> However, even if it's working after redeploy, it's not giving me a lot
> of trust to use it in a production environment having an issue like
> that.  I could reproduce it with 2 different mons so it's not just an
> exception.
>
> My setup is based on Ubuntu 20.04 and docker instead of podman :
>
> root@node01:~# docker -v
> Docker version 20.10.6, build 370c289
>
> Do you know a workaround for this issue or is this a known bug ? I
> noticed that there are some other complaints with the same behaviour in
> Octopus as well and the solution at that time was to delete the
> /var/lib/ceph/mon folder .
>
>
> Thanks.
>
>
>
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-05-25 Thread Adrian Nicolae

Hi,

On my setup I didn't enable a strech cluster. It's just a 3 x VM setup 
running on the same Proxmox node, all the nodes are using a single 
unique network. I installed Ceph using the documented cephadm flow.


Thanks for the confirmation, Greg! I‘ll try with a newer release then.   >That’s why we’re testing, isn’t it? ;-) >Then the OPs issue is 
probably not resolved yet since he didn’t >mention a stretch cluster. 
Sorry for high-jacking the thread.




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-05-25 Thread Eugen Block
Thanks for the confirmation, Greg! I‘ll try with a newer release then.  
That’s why we’re testing, isn’t it? ;-)
Then the OPs issue is probably not resolved yet since he didn’t  
mention a stretch cluster. Sorry for high-jacking the thread.


Zitat von Gregory Farnum :


On Tue, May 25, 2021 at 7:17 AM Eugen Block  wrote:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUILD/ceph-16.2.4/src/osd/OSDMap.cc: In function 'void OSDMap::Incremental::encode(ceph::buffer::v15_2_0::list&, uint64_t) const' thread 7ff3b1aa1700  
time

2021-05-25T13:44:26.732857+
2021-05-25T15:44:26.989087+02:00 pacific1 conmon[5132]:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUILD/ceph-16.2.4/src/osd/OSDMap.cc: 658: FAILED ceph_assert(target_v  
>=

9)
2021-05-25T15:44:26.989163+02:00 pacific1 conmon[5132]:
2021-05-25T15:44:26.989239+02:00 pacific1 conmon[5132]:  ceph version
16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
2021-05-25T15:44:26.989314+02:00 pacific1 conmon[5132]:  1:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x158) [0x7ff3bf61a59c]
2021-05-25T15:44:26.989388+02:00 pacific1 conmon[5132]:  2:
/usr/lib64/ceph/libceph-common.so.2(+0x2767b6) [0x7ff3bf61a7b6]
2021-05-25T15:44:26.989489+02:00 pacific1 conmon[5132]:  3:
(OSDMap::Incremental::encode(ceph::buffer::v15_2_0::list&, unsigned
long) const+0x539) [0x7ff3bfa529f9]
2021-05-25T15:44:26.989560+02:00 pacific1 conmon[5132]:  4:
(OSDMonitor::reencode_incremental_map(ceph::buffer::v15_2_0::list&,
unsigned long)+0x1c9) [0x55e377b36df9]
2021-05-25T15:44:26.989627+02:00 pacific1 conmon[5132]:  5:
(OSDMonitor::get_version(unsigned long, unsigned long,
ceph::buffer::v15_2_0::list&)+0x1f4) [0x55e377b37234]
2021-05-25T15:44:26.989693+02:00 pacific1 conmon[5132]:  6:
(OSDMonitor::build_incremental(unsigned int, unsigned int, unsigned
long)+0x301) [0x55e377b3a3c1]
2021-05-25T15:44:26.989759+02:00 pacific1 conmon[5132]:  7:
(OSDMonitor::send_incremental(unsigned int, MonSession*, bool,
boost::intrusive_ptr)+0x104) [0x55e377b3b094]
2021-05-25T15:44:26.989825+02:00 pacific1 conmon[5132]:  8:
(OSDMonitor::check_osdmap_sub(Subscription*)+0x72) [0x55e377b42792]
2021-05-25T15:44:26.989891+02:00 pacific1 conmon[5132]:  9:
(Monitor::handle_subscribe(boost::intrusive_ptr)+0xe82)
[0x55e3779da402]
2021-05-25T15:44:26.989967+02:00 pacific1 conmon[5132]:  10:
(Monitor::dispatch_op(boost::intrusive_ptr)+0x78d)
[0x55e377a002ed]
2021-05-25T15:44:26.990046+02:00 pacific1 conmon[5132]:  11:
(Monitor::_ms_dispatch(Message*)+0x670) [0x55e377a01910]
2021-05-25T15:44:26.990113+02:00 pacific1 conmon[5132]:  12:
(Dispatcher::ms_dispatch2(boost::intrusive_ptr const&)+0x5c)
[0x55e377a2ffdc]
2021-05-25T15:44:26.990179+02:00 pacific1 conmon[5132]:  13:
(DispatchQueue::entry()+0x126a) [0x7ff3bf854b1a]
2021-05-25T15:44:26.990255+02:00 pacific1 conmon[5132]:  14:
(DispatchQueue::DispatchThread::entry()+0x11) [0x7ff3bf904b71]
2021-05-25T15:44:26.990330+02:00 pacific1 conmon[5132]:  15:
/lib64/libpthread.so.0(+0x814a) [0x7ff3bd10a14a]
2021-05-25T15:44:26.990420+02:00 pacific1 conmon[5132]:  16: clone()
2021-05-25T15:44:26.990497+02:00 pacific1 conmon[5132]:
2021-05-25T15:44:26.990573+02:00 pacific1 conmon[5132]: debug  0>
2021-05-25T13:44:26.742+ 7ff3b1aa1700 -1 *** Caught signal
(Aborted) **
2021-05-25T15:44:26.990648+02:00 pacific1 conmon[5132]:  in thread
7ff3b1aa1700 thread_name:ms_dispatch
2021-05-25T15:44:26.990723+02:00 pacific1 conmon[5132]:
2021-05-25T15:44:26.990806+02:00 pacific1 conmon[5132]:  ceph version
16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
2021-05-25T15:44:26.990883+02:00 pacific1 conmon[5132]:  1:
/lib64/libpthread.so.0(+0x12b20) [0x7ff3bd114b20]
2021-05-25T15:44:26.990958+02:00 pacific1 conmon[5132]:  2: gsignal()
2021-05-25T15:44:26.991034+02:00 pacific1 conmon[5132]:  3: abort()
2021-05-25T15:44:26.991110+02:00 pacific1 conmon[5132]:  4:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1a9) [0x7ff3bf61a5ed]
2021-05-25T15:44:26.991176+02:00 pacific1 conmon[5132]:  5:
/usr/lib64/ceph/libceph-common.so.2(+0x2767b6) [0x7ff3bf61a7b6]
2021-05-25T15:44:26.991251+02:00 pacific1 conmon[5132]:  6:
(OSDMap::Incremental::encode(ceph::buffer::v15_2_0::list&, unsigned
long) const+0x539) [0x7ff3bfa529f9]
2021-05-25T15:44:26.991326+02:00 pacific1 conmon[5132]:  7:
(OSDMonitor::reencode_incremental_map(ceph::buffer::v15_2_0::list&,
unsigned long)+0x1c9) [0x55e377b36df9]
2021-05-25T15:44:26.991393+02:00 pacific1 conmon[5132]:  8:
(OSDMonitor::get_version(unsigned long, unsigned long,
ceph::buffer::v15_2_0::list&)+0x1f4) [0x55e377b37234]
2021-05-25T15:44:26.991460+02:00 pacific1 conmon[5132]:  9:
(OSDMonitor::build_incremental(unsigned int, unsigned int, unsigned
long)+0x301) 

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-05-25 Thread Gregory Farnum
On Tue, May 25, 2021 at 7:17 AM Eugen Block  wrote:
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUILD/ceph-16.2.4/src/osd/OSDMap.cc:
>  In function 'void OSDMap::Incremental::encode(ceph::buffer::v15_2_0::list&, 
> uint64_t) const' thread 7ff3b1aa1700 time
> 2021-05-25T13:44:26.732857+
> 2021-05-25T15:44:26.989087+02:00 pacific1 conmon[5132]:
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUILD/ceph-16.2.4/src/osd/OSDMap.cc:
>  658: FAILED ceph_assert(target_v >=
> 9)
> 2021-05-25T15:44:26.989163+02:00 pacific1 conmon[5132]:
> 2021-05-25T15:44:26.989239+02:00 pacific1 conmon[5132]:  ceph version
> 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
> 2021-05-25T15:44:26.989314+02:00 pacific1 conmon[5132]:  1:
> (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x158) [0x7ff3bf61a59c]
> 2021-05-25T15:44:26.989388+02:00 pacific1 conmon[5132]:  2:
> /usr/lib64/ceph/libceph-common.so.2(+0x2767b6) [0x7ff3bf61a7b6]
> 2021-05-25T15:44:26.989489+02:00 pacific1 conmon[5132]:  3:
> (OSDMap::Incremental::encode(ceph::buffer::v15_2_0::list&, unsigned
> long) const+0x539) [0x7ff3bfa529f9]
> 2021-05-25T15:44:26.989560+02:00 pacific1 conmon[5132]:  4:
> (OSDMonitor::reencode_incremental_map(ceph::buffer::v15_2_0::list&,
> unsigned long)+0x1c9) [0x55e377b36df9]
> 2021-05-25T15:44:26.989627+02:00 pacific1 conmon[5132]:  5:
> (OSDMonitor::get_version(unsigned long, unsigned long,
> ceph::buffer::v15_2_0::list&)+0x1f4) [0x55e377b37234]
> 2021-05-25T15:44:26.989693+02:00 pacific1 conmon[5132]:  6:
> (OSDMonitor::build_incremental(unsigned int, unsigned int, unsigned
> long)+0x301) [0x55e377b3a3c1]
> 2021-05-25T15:44:26.989759+02:00 pacific1 conmon[5132]:  7:
> (OSDMonitor::send_incremental(unsigned int, MonSession*, bool,
> boost::intrusive_ptr)+0x104) [0x55e377b3b094]
> 2021-05-25T15:44:26.989825+02:00 pacific1 conmon[5132]:  8:
> (OSDMonitor::check_osdmap_sub(Subscription*)+0x72) [0x55e377b42792]
> 2021-05-25T15:44:26.989891+02:00 pacific1 conmon[5132]:  9:
> (Monitor::handle_subscribe(boost::intrusive_ptr)+0xe82)
> [0x55e3779da402]
> 2021-05-25T15:44:26.989967+02:00 pacific1 conmon[5132]:  10:
> (Monitor::dispatch_op(boost::intrusive_ptr)+0x78d)
> [0x55e377a002ed]
> 2021-05-25T15:44:26.990046+02:00 pacific1 conmon[5132]:  11:
> (Monitor::_ms_dispatch(Message*)+0x670) [0x55e377a01910]
> 2021-05-25T15:44:26.990113+02:00 pacific1 conmon[5132]:  12:
> (Dispatcher::ms_dispatch2(boost::intrusive_ptr const&)+0x5c)
> [0x55e377a2ffdc]
> 2021-05-25T15:44:26.990179+02:00 pacific1 conmon[5132]:  13:
> (DispatchQueue::entry()+0x126a) [0x7ff3bf854b1a]
> 2021-05-25T15:44:26.990255+02:00 pacific1 conmon[5132]:  14:
> (DispatchQueue::DispatchThread::entry()+0x11) [0x7ff3bf904b71]
> 2021-05-25T15:44:26.990330+02:00 pacific1 conmon[5132]:  15:
> /lib64/libpthread.so.0(+0x814a) [0x7ff3bd10a14a]
> 2021-05-25T15:44:26.990420+02:00 pacific1 conmon[5132]:  16: clone()
> 2021-05-25T15:44:26.990497+02:00 pacific1 conmon[5132]:
> 2021-05-25T15:44:26.990573+02:00 pacific1 conmon[5132]: debug  0>
> 2021-05-25T13:44:26.742+ 7ff3b1aa1700 -1 *** Caught signal
> (Aborted) **
> 2021-05-25T15:44:26.990648+02:00 pacific1 conmon[5132]:  in thread
> 7ff3b1aa1700 thread_name:ms_dispatch
> 2021-05-25T15:44:26.990723+02:00 pacific1 conmon[5132]:
> 2021-05-25T15:44:26.990806+02:00 pacific1 conmon[5132]:  ceph version
> 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
> 2021-05-25T15:44:26.990883+02:00 pacific1 conmon[5132]:  1:
> /lib64/libpthread.so.0(+0x12b20) [0x7ff3bd114b20]
> 2021-05-25T15:44:26.990958+02:00 pacific1 conmon[5132]:  2: gsignal()
> 2021-05-25T15:44:26.991034+02:00 pacific1 conmon[5132]:  3: abort()
> 2021-05-25T15:44:26.991110+02:00 pacific1 conmon[5132]:  4:
> (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x1a9) [0x7ff3bf61a5ed]
> 2021-05-25T15:44:26.991176+02:00 pacific1 conmon[5132]:  5:
> /usr/lib64/ceph/libceph-common.so.2(+0x2767b6) [0x7ff3bf61a7b6]
> 2021-05-25T15:44:26.991251+02:00 pacific1 conmon[5132]:  6:
> (OSDMap::Incremental::encode(ceph::buffer::v15_2_0::list&, unsigned
> long) const+0x539) [0x7ff3bfa529f9]
> 2021-05-25T15:44:26.991326+02:00 pacific1 conmon[5132]:  7:
> (OSDMonitor::reencode_incremental_map(ceph::buffer::v15_2_0::list&,
> unsigned long)+0x1c9) [0x55e377b36df9]
> 2021-05-25T15:44:26.991393+02:00 pacific1 conmon[5132]:  8:
> (OSDMonitor::get_version(unsigned long, unsigned long,
> ceph::buffer::v15_2_0::list&)+0x1f4) [0x55e377b37234]
> 2021-05-25T15:44:26.991460+02:00 pacific1 conmon[5132]:  9:
> (OSDMonitor::build_incremental(unsigned int, unsigned int, unsigned
> long)+0x301) [0x55e377b3a3c1]
> 2021-05-25T15:44:26.991557+02:00 pacific1 conmon[5132]:  10:
> 

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-05-25 Thread Eugen Block

Hi,

I wanted to explore the stretch mode in pacific (16.2.4) and see how  
it behaves with a DC failure. It seems as if I'm hitting the same or  
at least a similar issue here. To verify if it's the stretch mode I  
removed the cluster and rebuilt it without stretch mode, three hosts  
in three DCs and started to reboot. First I rebooted one node, the  
cluster came back to HEALTH_OK. Then I rebooted two of the three nodes  
and again everything recovered successfully.
Then I rebuilt a 5 node cluster, two DCs in stretch mode with three  
MONs, one being a tiebreaker in a virtual third DC. The stretch rule  
was applied (4 replicas across all 4 nodes).


To test a DC failure I simply shut down two nodes from DC2, although  
the pool's min_size was reduced to 1 by ceph I couldn't read or write  
anything to a mapped rbd, althouh ceph still was responsive with two  
active MONs.
When I booted the other two nodes again the cluster was not able to  
recover, it ends up in a loop of restarting the MON containers (the  
OSDs recover eventually) until systemd shuts them down due to too many  
restarts.
For a couple of seconds I get a ceph status, but I never get all three  
MONs up. When there are two MONs up and I restart the missing one a  
different MON is shut down.


I also see the error message mentioned here in this thread

heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7ff3b3aa5700'  
had timed out after 0.0s


I'll add some more information, a stack trace from MON failure:

---snip---
2021-05-25T15:44:26.988562+02:00 pacific1 conmon[5132]: 5  
mon.pacific1@0(leader).paxos(paxos updating c 9288..9839) is_readable  
= 1 - now=2021-05-25T13:44:26.730359+  
lease_expire=2021-05-25T13:44:30.270907+ has v0 lc 9839
2021-05-25T15:44:26.988638+02:00 pacific1 conmon[5132]: debug -5>  
2021-05-25T13:44:26.726+ 7ff3b1aa1700  2 mon.pacific1@0(leader)  
e13 send_reply 0x55e37aae3860 0x55e37affa9c0 auth_reply(proto 2 0 (0)  
Success) v1
2021-05-25T15:44:26.988714+02:00 pacific1 conmon[5132]: debug -4>  
2021-05-25T13:44:26.726+ 7ff3b1aa1700  5  
mon.pacific1@0(leader).paxos(paxos updating c 9288..9839) is_readable  
= 1 - now=2021-05-25T13:44:26.731084+  
lease_expire=2021-05-25T13:44:30.270907+ has v0 lc 9839
2021-05-25T15:44:26.988790+02:00 pacific1 conmon[5132]: debug -3>  
2021-05-25T13:44:26.726+ 7ff3b1aa1700  2 mon.pacific1@0(leader)  
e13 send_reply 0x55e37b14def0 0x55e37ab11ba0 auth_reply(proto 2 0 (0)  
Success) v1
2021-05-25T15:44:26.988929+02:00 pacific1 conmon[5132]: debug -2>  
2021-05-25T13:44:26.730+ 7ff3b1aa1700  5  
mon.pacific1@0(leader).osd e117 send_incremental [105..117] to  
client.84146
2021-05-25T15:44:26.989012+02:00 pacific1 conmon[5132]: debug -1>  
2021-05-25T13:44:26.734+ 7ff3b1aa1700 -1  
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUILD/ceph-16.2.4/src/osd/OSDMap.cc: In function 'void OSDMap::Incremental::encode(ceph::buffer::v15_2_0::list&, uint64_t) const' thread 7ff3b1aa1700 time  
2021-05-25T13:44:26.732857+
2021-05-25T15:44:26.989087+02:00 pacific1 conmon[5132]:  
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUILD/ceph-16.2.4/src/osd/OSDMap.cc: 658: FAILED ceph_assert(target_v >=  
9)

2021-05-25T15:44:26.989163+02:00 pacific1 conmon[5132]:
2021-05-25T15:44:26.989239+02:00 pacific1 conmon[5132]:  ceph version  
16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
2021-05-25T15:44:26.989314+02:00 pacific1 conmon[5132]:  1:  
(ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x158) [0x7ff3bf61a59c]
2021-05-25T15:44:26.989388+02:00 pacific1 conmon[5132]:  2:  
/usr/lib64/ceph/libceph-common.so.2(+0x2767b6) [0x7ff3bf61a7b6]
2021-05-25T15:44:26.989489+02:00 pacific1 conmon[5132]:  3:  
(OSDMap::Incremental::encode(ceph::buffer::v15_2_0::list&, unsigned  
long) const+0x539) [0x7ff3bfa529f9]
2021-05-25T15:44:26.989560+02:00 pacific1 conmon[5132]:  4:  
(OSDMonitor::reencode_incremental_map(ceph::buffer::v15_2_0::list&,  
unsigned long)+0x1c9) [0x55e377b36df9]
2021-05-25T15:44:26.989627+02:00 pacific1 conmon[5132]:  5:  
(OSDMonitor::get_version(unsigned long, unsigned long,  
ceph::buffer::v15_2_0::list&)+0x1f4) [0x55e377b37234]
2021-05-25T15:44:26.989693+02:00 pacific1 conmon[5132]:  6:  
(OSDMonitor::build_incremental(unsigned int, unsigned int, unsigned  
long)+0x301) [0x55e377b3a3c1]
2021-05-25T15:44:26.989759+02:00 pacific1 conmon[5132]:  7:  
(OSDMonitor::send_incremental(unsigned int, MonSession*, bool,  
boost::intrusive_ptr)+0x104) [0x55e377b3b094]
2021-05-25T15:44:26.989825+02:00 pacific1 conmon[5132]:  8:  
(OSDMonitor::check_osdmap_sub(Subscription*)+0x72) [0x55e377b42792]
2021-05-25T15:44:26.989891+02:00 pacific1 conmon[5132]:  9:  

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-05-23 Thread Adrian Nicolae
I think that the orchestrator is trying to bring it up but it's not 
starting (see the errors from my previous e-mail) - the container is not 
starting even if I tried to start it manually.


the placement is the default one , ceph started the mons automatically 
on all my hosts because I only have 3 and the default mon number is 5.


root@node01:/home/adrian# ceph orch ls
NAME   PORTS   RUNNING  REFRESHED  AGE PLACEMENT
alertmanager   1/1  16m ago    5h count:1
crash  3/3  16m ago    5h   *
grafana    1/1  16m ago    5h count:1
mgr    2/2  16m ago    5h count:2
mon    3/5  16m ago    10h count:5
node-exporter  3/3  16m ago    5h   *
osd.all-available-devices    12/15  16m ago    4h   *
prometheus 1/1  16m ago    5h count:1
rgw.digi1  ?:8000  3/3  16m ago    3h 
node01;node02;node03;count:3


I've added the hosts using only the hostnames :

root@node01:/home/adrian# ceph orch host ls
HOST    ADDR  LABELS  STATUS
node01  192.168.80.2
node02  node02
node03  node03




On 5/23/2021 7:52 PM, 胡 玮文 wrote:

So the orchestrator is aware of that mon is stopped, but not tried to bring it 
up again. What is the placement of mon shown in “ceph orch ls”? I explicitly 
set it to all host names (e.g. node01;node02;node03), and haven’t experienced 
this.


在 2021年5月24日,00:35,Adrian Nicolae  写道:

Hi,

I waited for more than a day on the first mon failure, it didn't resolve 
automatically.

I checked with 'ceph status'  and also the ceph.conf on that hosts and the 
failed mon was removed from the monmap.  The cluster reported only 2 mons 
(instead of 3) and the third mon was completely removed from config , it wasn't 
reported as a failure on 'ceph status'.



On 5/23/2021 7:30 PM, 胡 玮文 wrote:
Hi Adrian,

I have not tried, but I think it will resolve itself automatically after some 
minutes. How long have you waited before you do the manual redeploy?

Could you also try “ceph mon dump” to see whether mon.node03 is actually 
removed from monmap when it failed to start?


在 2021年5月23日,16:40,Adrian Nicolae  写道:

Hi guys,

I'm testing Ceph Pacific 16.2.4 in my lab before deciding if I will put it in 
production on a 1PB+ storage cluster with rgw-only access.

I noticed a weird issue with my mons :

- if I reboot a mon host, the ceph-mon container is not starting after reboot

- I can see with 'ceph orch ps' the following output :

mon.node01   node01   running (20h)   4m ago 20h   
16.2.4 8d91d370c2b8  0a2e86af94b2
mon.node02   node02   running (115m)  12s ago115m  
16.2.4 8d91d370c2b8  51f4885a1b06
mon.node03   node03   stopped 4m ago 19h 
 

(where node03 is the host which was rebooted).

- I tried to start the mon container manually on node03 with '/bin/bash 
/var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run' and 
I've got the following output :

debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd e408 
crush map has features 3314933069573799936, adjusting msgr requires
debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd e408 
crush map has features 43262930805112, adjusting msgr requires
debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd e408 
crush map has features 43262930805112, adjusting msgr requires
debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd e408 
crush map has features 43262930805112, adjusting msgr requires
cluster 2021-05-23T08:07:12.189243+ mgr.node01.ksitls (mgr.14164) 36380 : 
cluster [DBG] pgmap v36392: 417 pgs: 417 active+clean; 33 KiB data, 605 MiB 
used, 651 GiB / 652 GiB avail; 9.6 KiB/s rd, 0 B/s wr, 15 op/s
debug 2021-05-23T08:24:25.196+ 7f9a9e358700  1 
mon.node03@-1(???).paxosservice(auth 1..51) refresh upgraded, format 0 -> 3
debug 2021-05-23T08:24:25.208+ 7f9a88176700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7f9a88176700' had timed out after 0.0s
debug 2021-05-23T08:24:25.208+ 7f9a9e358700  0 mon.node03@-1(probing) e5  
my rank is now 1 (was -1)
debug 2021-05-23T08:24:25.212+ 7f9a87975700  0 mon.node03@1(probing) e6  
removed from monmap, suicide.

root@node03:/home/adrian# systemctl status 
ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service
● ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service - Ceph 
mon.node03 for c2d41ac4-baf5-11eb-865d-2dc838a337a3
  Loaded: loaded 
(/etc/systemd/system/ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@.service; 
enabled; vendor preset: enabled)
  Active: inactive (dead) since Sun 2021-05-23 08:10:00 UTC; 16min ago
 Process: 1176 ExecStart=/bin/bash 

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-05-23 Thread Adrian Nicolae

Hi,

I waited for more than a day on the first mon failure, it didn't resolve 
automatically.


I checked with 'ceph status'  and also the ceph.conf on that hosts and 
the failed mon was removed from the monmap.  The cluster reported only 2 
mons (instead of 3) and the third mon was completely removed from config 
, it wasn't reported as a failure on 'ceph status'.



On 5/23/2021 7:30 PM, 胡 玮文 wrote:

Hi Adrian,

I have not tried, but I think it will resolve itself automatically after some 
minutes. How long have you waited before you do the manual redeploy?

Could you also try “ceph mon dump” to see whether mon.node03 is actually 
removed from monmap when it failed to start?


在 2021年5月23日,16:40,Adrian Nicolae  写道:

Hi guys,

I'm testing Ceph Pacific 16.2.4 in my lab before deciding if I will put it in 
production on a 1PB+ storage cluster with rgw-only access.

I noticed a weird issue with my mons :

- if I reboot a mon host, the ceph-mon container is not starting after reboot

- I can see with 'ceph orch ps' the following output :

mon.node01   node01   running (20h)   4m ago 20h   
16.2.4 8d91d370c2b8  0a2e86af94b2
mon.node02   node02   running (115m)  12s ago115m  
16.2.4 8d91d370c2b8  51f4885a1b06
mon.node03   node03   stopped 4m ago 19h 
 

(where node03 is the host which was rebooted).

- I tried to start the mon container manually on node03 with '/bin/bash 
/var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run' and 
I've got the following output :

debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd e408 
crush map has features 3314933069573799936, adjusting msgr requires
debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd e408 
crush map has features 43262930805112, adjusting msgr requires
debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd e408 
crush map has features 43262930805112, adjusting msgr requires
debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd e408 
crush map has features 43262930805112, adjusting msgr requires
cluster 2021-05-23T08:07:12.189243+ mgr.node01.ksitls (mgr.14164) 36380 : 
cluster [DBG] pgmap v36392: 417 pgs: 417 active+clean; 33 KiB data, 605 MiB 
used, 651 GiB / 652 GiB avail; 9.6 KiB/s rd, 0 B/s wr, 15 op/s
debug 2021-05-23T08:24:25.196+ 7f9a9e358700  1 
mon.node03@-1(???).paxosservice(auth 1..51) refresh upgraded, format 0 -> 3
debug 2021-05-23T08:24:25.208+ 7f9a88176700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7f9a88176700' had timed out after 0.0s
debug 2021-05-23T08:24:25.208+ 7f9a9e358700  0 mon.node03@-1(probing) e5  
my rank is now 1 (was -1)
debug 2021-05-23T08:24:25.212+ 7f9a87975700  0 mon.node03@1(probing) e6  
removed from monmap, suicide.

root@node03:/home/adrian# systemctl status 
ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service
● ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service - Ceph 
mon.node03 for c2d41ac4-baf5-11eb-865d-2dc838a337a3
  Loaded: loaded 
(/etc/systemd/system/ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@.service; 
enabled; vendor preset: enabled)
  Active: inactive (dead) since Sun 2021-05-23 08:10:00 UTC; 16min ago
 Process: 1176 ExecStart=/bin/bash 
/var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run 
(code=exited, status=0/SUCCESS)
 Process: 1855 ExecStop=/usr/bin/docker stop 
ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3-mon.node03 (code=exited, 
status=1/FAILURE)
 Process: 1861 ExecStopPost=/bin/bash 
/var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.poststop 
(code=exited, status=0/SUCCESS)
Main PID: 1176 (code=exited, status=0/SUCCESS)

The only fix I could find was to redeploy the mon with :

ceph orch daemon rm  mon.node03 --force
ceph orch daemon add mon node03

However, even if it's working after redeploy, it's not giving me a lot of trust 
to use it in a production environment having an issue like that.  I could 
reproduce it with 2 different mons so it's not just an exception.

My setup is based on Ubuntu 20.04 and docker instead of podman :

root@node01:~# docker -v
Docker version 20.10.6, build 370c289

Do you know a workaround for this issue or is this a known bug ? I noticed that 
there are some other complaints with the same behaviour in Octopus as well and 
the solution at that time was to delete the /var/lib/ceph/mon folder .


Thanks.






___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-05-23 Thread Adrian Nicolae

It's a fresh Pacific install with the default settings on all hosts :

root@node01:/home/adrian# ceph config show-with-defaults mon.node03 | 
grep msgr

mon_warn_on_msgr2_not_enabled true default
ms_bind_msgr1 true default
ms_bind_msgr2 true



On 5/23/2021 5:50 PM, Szabo, Istvan (Agoda) wrote:
Not sure it’s the issue, but it complaina bour msgr not msgr2, do you 
have the v1  amd v2 adresses in the ceph.conf on that specific osds?


Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com 
---

On 2021. May 23., at 15:40, Adrian Nicolae 
 wrote:


Hi guys,

I'm testing Ceph Pacific 16.2.4 in my lab before deciding if I will 
put it in production on a 1PB+ storage cluster with rgw-only access.


I noticed a weird issue with my mons :

- if I reboot a mon host, the ceph-mon container is not starting 
after reboot


- I can see with 'ceph orch ps' the following output :

mon.node01   node01   running (20h)   4m 
ago 20h   16.2.4 8d91d370c2b8 0a2e86af94b2
mon.node02   node02   running (115m)  12s 
ago    115m  16.2.4 8d91d370c2b8 51f4885a1b06
mon.node03   node03 stopped 4m ago 19h   
  


(where node03 is the host which was rebooted).

- I tried to start the mon container manually on node03 with 
'/bin/bash 
/var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run' 
and I've got the following output :


debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 
mon.node03@-1(???).osd e408 crush map has features 
3314933069573799936, adjusting msgr requires
debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 
mon.node03@-1(???).osd e408 crush map has features 
43262930805112, adjusting msgr requires
debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 
mon.node03@-1(???).osd e408 crush map has features 
43262930805112, adjusting msgr requires
debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 
mon.node03@-1(???).osd e408 crush map has features 
43262930805112, adjusting msgr requires
cluster 2021-05-23T08:07:12.189243+ mgr.node01.ksitls (mgr.14164) 
36380 : cluster [DBG] pgmap v36392: 417 pgs: 417 active+clean; 33 KiB 
data, 605 MiB used, 651 GiB / 652 GiB avail; 9.6 KiB/s rd, 0 B/s wr, 
15 op/s
debug 2021-05-23T08:24:25.196+ 7f9a9e358700  1 
mon.node03@-1(???).paxosservice(auth 1..51) refresh upgraded, format 
0 -> 3
debug 2021-05-23T08:24:25.208+ 7f9a88176700  1 heartbeat_map 
reset_timeout 'Monitor::cpu_tp thread 0x7f9a88176700' had timed out 
after 0.0s
debug 2021-05-23T08:24:25.208+ 7f9a9e358700  0 
mon.node03@-1(probing) e5  my rank is now 1 (was -1)
debug 2021-05-23T08:24:25.212+ 7f9a87975700  0 
mon.node03@1(probing) e6  removed from monmap, suicide.


root@node03:/home/adrian# systemctl status 
ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service
● ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service - Ceph 
mon.node03 for c2d41ac4-baf5-11eb-865d-2dc838a337a3
 Loaded: loaded 
(/etc/systemd/system/ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@.service; 
enabled; vendor preset: enabled)

 Active: inactive (dead) since Sun 2021-05-23 08:10:00 UTC; 16min ago
    Process: 1176 ExecStart=/bin/bash 
/var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run 
(code=exited, status=0/SUCCESS)
    Process: 1855 ExecStop=/usr/bin/docker stop 
ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3-mon.node03 (code=exited, 
status=1/FAILURE)
    Process: 1861 ExecStopPost=/bin/bash 
/var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.poststop 
(code=exited, status=0/SUCCESS)

   Main PID: 1176 (code=exited, status=0/SUCCESS)

The only fix I could find was to redeploy the mon with :

ceph orch daemon rm  mon.node03 --force
ceph orch daemon add mon node03

However, even if it's working after redeploy, it's not giving me a 
lot of trust to use it in a production environment having an issue 
like that.  I could reproduce it with 2 different mons so it's not 
just an exception.


My setup is based on Ubuntu 20.04 and docker instead of podman :

root@node01:~# docker -v
Docker version 20.10.6, build 370c289

Do you know a workaround for this issue or is this a known bug ? I 
noticed that there are some other complaints with the same behaviour 
in Octopus as well and the solution at that time was to delete the 
/var/lib/ceph/mon folder .



Thanks.






___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io





  This message is confidential and is for the sole use of the
  intended recipient(s). It may also be privileged or
  otherwise protected by copyright or other legal rules. If
  you 

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-05-23 Thread Szabo, Istvan (Agoda)
Not sure it’s the issue, but it complaina bour msgr not msgr2, do you have the 
v1  amd v2 adresses in the ceph.conf on that specific osds?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. May 23., at 15:40, Adrian Nicolae  wrote:

Hi guys,

I'm testing Ceph Pacific 16.2.4 in my lab before deciding if I will put it in 
production on a 1PB+ storage cluster with rgw-only access.

I noticed a weird issue with my mons :

- if I reboot a mon host, the ceph-mon container is not starting after reboot

- I can see with 'ceph orch ps' the following output :

mon.node01   node01   running (20h)   4m ago 20h   
16.2.4 8d91d370c2b8  0a2e86af94b2
mon.node02   node02   running (115m)  12s ago115m  
16.2.4 8d91d370c2b8  51f4885a1b06
mon.node03   node03   stopped 4m ago 19h   
   

(where node03 is the host which was rebooted).

- I tried to start the mon container manually on node03 with '/bin/bash 
/var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run' and 
I've got the following output :

debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd e408 
crush map has features 3314933069573799936, adjusting msgr requires
debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd e408 
crush map has features 43262930805112, adjusting msgr requires
debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd e408 
crush map has features 43262930805112, adjusting msgr requires
debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd e408 
crush map has features 43262930805112, adjusting msgr requires
cluster 2021-05-23T08:07:12.189243+ mgr.node01.ksitls (mgr.14164) 36380 : 
cluster [DBG] pgmap v36392: 417 pgs: 417 active+clean; 33 KiB data, 605 MiB 
used, 651 GiB / 652 GiB avail; 9.6 KiB/s rd, 0 B/s wr, 15 op/s
debug 2021-05-23T08:24:25.196+ 7f9a9e358700  1 
mon.node03@-1(???).paxosservice(auth 1..51) refresh upgraded, format 0 -> 3
debug 2021-05-23T08:24:25.208+ 7f9a88176700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7f9a88176700' had timed out after 0.0s
debug 2021-05-23T08:24:25.208+ 7f9a9e358700  0 mon.node03@-1(probing) e5  
my rank is now 1 (was -1)
debug 2021-05-23T08:24:25.212+ 7f9a87975700  0 mon.node03@1(probing) e6  
removed from monmap, suicide.

root@node03:/home/adrian# systemctl status 
ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service
● ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service - Ceph 
mon.node03 for c2d41ac4-baf5-11eb-865d-2dc838a337a3
 Loaded: loaded 
(/etc/systemd/system/ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@.service; 
enabled; vendor preset: enabled)
 Active: inactive (dead) since Sun 2021-05-23 08:10:00 UTC; 16min ago
Process: 1176 ExecStart=/bin/bash 
/var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run 
(code=exited, status=0/SUCCESS)
Process: 1855 ExecStop=/usr/bin/docker stop 
ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3-mon.node03 (code=exited, 
status=1/FAILURE)
Process: 1861 ExecStopPost=/bin/bash 
/var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.poststop 
(code=exited, status=0/SUCCESS)
   Main PID: 1176 (code=exited, status=0/SUCCESS)

The only fix I could find was to redeploy the mon with :

ceph orch daemon rm  mon.node03 --force
ceph orch daemon add mon node03

However, even if it's working after redeploy, it's not giving me a lot of trust 
to use it in a production environment having an issue like that.  I could 
reproduce it with 2 different mons so it's not just an exception.

My setup is based on Ubuntu 20.04 and docker instead of podman :

root@node01:~# docker -v
Docker version 20.10.6, build 370c289

Do you know a workaround for this issue or is this a known bug ? I noticed that 
there are some other complaints with the same behaviour in Octopus as well and 
the solution at that time was to delete the /var/lib/ceph/mon folder .


Thanks.






___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company 

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-05-23 Thread 胡 玮文
So the orchestrator is aware of that mon is stopped, but not tried to bring it 
up again. What is the placement of mon shown in “ceph orch ls”? I explicitly 
set it to all host names (e.g. node01;node02;node03), and haven’t experienced 
this.

> 在 2021年5月24日,00:35,Adrian Nicolae  写道:
> 
> Hi,
> 
> I waited for more than a day on the first mon failure, it didn't resolve 
> automatically.
> 
> I checked with 'ceph status'  and also the ceph.conf on that hosts and the 
> failed mon was removed from the monmap.  The cluster reported only 2 mons 
> (instead of 3) and the third mon was completely removed from config , it 
> wasn't reported as a failure on 'ceph status'.
> 
> 
>> On 5/23/2021 7:30 PM, 胡 玮文 wrote:
>> Hi Adrian,
>> 
>> I have not tried, but I think it will resolve itself automatically after 
>> some minutes. How long have you waited before you do the manual redeploy?
>> 
>> Could you also try “ceph mon dump” to see whether mon.node03 is actually 
>> removed from monmap when it failed to start?
>> 
 在 2021年5月23日,16:40,Adrian Nicolae  写道:
>>> 
>>> Hi guys,
>>> 
>>> I'm testing Ceph Pacific 16.2.4 in my lab before deciding if I will put it 
>>> in production on a 1PB+ storage cluster with rgw-only access.
>>> 
>>> I noticed a weird issue with my mons :
>>> 
>>> - if I reboot a mon host, the ceph-mon container is not starting after 
>>> reboot
>>> 
>>> - I can see with 'ceph orch ps' the following output :
>>> 
>>> mon.node01   node01   running (20h)   4m ago 
>>> 20h   16.2.4 8d91d370c2b8  0a2e86af94b2
>>> mon.node02   node02   running (115m)  12s ago
>>> 115m  16.2.4 8d91d370c2b8  51f4885a1b06
>>> mon.node03   node03   stopped 4m ago 
>>> 19h  
>>> 
>>> (where node03 is the host which was rebooted).
>>> 
>>> - I tried to start the mon container manually on node03 with '/bin/bash 
>>> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run' and 
>>> I've got the following output :
>>> 
>>> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd 
>>> e408 crush map has features 3314933069573799936, adjusting msgr requires
>>> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd 
>>> e408 crush map has features 43262930805112, adjusting msgr requires
>>> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd 
>>> e408 crush map has features 43262930805112, adjusting msgr requires
>>> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd 
>>> e408 crush map has features 43262930805112, adjusting msgr requires
>>> cluster 2021-05-23T08:07:12.189243+ mgr.node01.ksitls (mgr.14164) 36380 
>>> : cluster [DBG] pgmap v36392: 417 pgs: 417 active+clean; 33 KiB data, 605 
>>> MiB used, 651 GiB / 652 GiB avail; 9.6 KiB/s rd, 0 B/s wr, 15 op/s
>>> debug 2021-05-23T08:24:25.196+ 7f9a9e358700  1 
>>> mon.node03@-1(???).paxosservice(auth 1..51) refresh upgraded, format 0 -> 3
>>> debug 2021-05-23T08:24:25.208+ 7f9a88176700  1 heartbeat_map 
>>> reset_timeout 'Monitor::cpu_tp thread 0x7f9a88176700' had timed out after 
>>> 0.0s
>>> debug 2021-05-23T08:24:25.208+ 7f9a9e358700  0 mon.node03@-1(probing) 
>>> e5  my rank is now 1 (was -1)
>>> debug 2021-05-23T08:24:25.212+ 7f9a87975700  0 mon.node03@1(probing) e6 
>>>  removed from monmap, suicide.
>>> 
>>> root@node03:/home/adrian# systemctl status 
>>> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service
>>> ● ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service - Ceph 
>>> mon.node03 for c2d41ac4-baf5-11eb-865d-2dc838a337a3
>>>  Loaded: loaded 
>>> (/etc/systemd/system/ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@.service; 
>>> enabled; vendor preset: enabled)
>>>  Active: inactive (dead) since Sun 2021-05-23 08:10:00 UTC; 16min ago
>>> Process: 1176 ExecStart=/bin/bash 
>>> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run 
>>> (code=exited, status=0/SUCCESS)
>>> Process: 1855 ExecStop=/usr/bin/docker stop 
>>> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3-mon.node03 (code=exited, 
>>> status=1/FAILURE)
>>> Process: 1861 ExecStopPost=/bin/bash 
>>> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.poststop 
>>> (code=exited, status=0/SUCCESS)
>>>Main PID: 1176 (code=exited, status=0/SUCCESS)
>>> 
>>> The only fix I could find was to redeploy the mon with :
>>> 
>>> ceph orch daemon rm  mon.node03 --force
>>> ceph orch daemon add mon node03
>>> 
>>> However, even if it's working after redeploy, it's not giving me a lot of 
>>> trust to use it in a production environment having an issue like that.  I 
>>> could reproduce it with 2 different mons so it's not just an exception.
>>> 
>>> My setup is based on Ubuntu 20.04 and docker instead of podman :
>>> 
>>> root@node01:~# docker -v
>>> Docker version 20.10.6, build 370c289
>>> 
>>> Do you know a workaround 

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-05-23 Thread 胡 玮文
Hi Adrian,

I have not tried, but I think it will resolve itself automatically after some 
minutes. How long have you waited before you do the manual redeploy?

Could you also try “ceph mon dump” to see whether mon.node03 is actually 
removed from monmap when it failed to start?

> 在 2021年5月23日,16:40,Adrian Nicolae  写道:
> 
> Hi guys,
> 
> I'm testing Ceph Pacific 16.2.4 in my lab before deciding if I will put it in 
> production on a 1PB+ storage cluster with rgw-only access.
> 
> I noticed a weird issue with my mons :
> 
> - if I reboot a mon host, the ceph-mon container is not starting after reboot
> 
> - I can see with 'ceph orch ps' the following output :
> 
> mon.node01   node01   running (20h)   4m ago 20h  
>  16.2.4 8d91d370c2b8  0a2e86af94b2
> mon.node02   node02   running (115m)  12s ago115m 
>  16.2.4 8d91d370c2b8  51f4885a1b06
> mon.node03   node03   stopped 4m ago 19h  
> 
> 
> (where node03 is the host which was rebooted).
> 
> - I tried to start the mon container manually on node03 with '/bin/bash 
> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run' and 
> I've got the following output :
> 
> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd 
> e408 crush map has features 3314933069573799936, adjusting msgr requires
> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd 
> e408 crush map has features 43262930805112, adjusting msgr requires
> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd 
> e408 crush map has features 43262930805112, adjusting msgr requires
> debug 2021-05-23T08:24:25.192+ 7f9a9e358700  0 mon.node03@-1(???).osd 
> e408 crush map has features 43262930805112, adjusting msgr requires
> cluster 2021-05-23T08:07:12.189243+ mgr.node01.ksitls (mgr.14164) 36380 : 
> cluster [DBG] pgmap v36392: 417 pgs: 417 active+clean; 33 KiB data, 605 MiB 
> used, 651 GiB / 652 GiB avail; 9.6 KiB/s rd, 0 B/s wr, 15 op/s
> debug 2021-05-23T08:24:25.196+ 7f9a9e358700  1 
> mon.node03@-1(???).paxosservice(auth 1..51) refresh upgraded, format 0 -> 3
> debug 2021-05-23T08:24:25.208+ 7f9a88176700  1 heartbeat_map 
> reset_timeout 'Monitor::cpu_tp thread 0x7f9a88176700' had timed out after 
> 0.0s
> debug 2021-05-23T08:24:25.208+ 7f9a9e358700  0 mon.node03@-1(probing) e5  
> my rank is now 1 (was -1)
> debug 2021-05-23T08:24:25.212+ 7f9a87975700  0 mon.node03@1(probing) e6  
> removed from monmap, suicide.
> 
> root@node03:/home/adrian# systemctl status 
> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service
> ● ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service - Ceph 
> mon.node03 for c2d41ac4-baf5-11eb-865d-2dc838a337a3
>  Loaded: loaded 
> (/etc/systemd/system/ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@.service; 
> enabled; vendor preset: enabled)
>  Active: inactive (dead) since Sun 2021-05-23 08:10:00 UTC; 16min ago
> Process: 1176 ExecStart=/bin/bash 
> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run 
> (code=exited, status=0/SUCCESS)
> Process: 1855 ExecStop=/usr/bin/docker stop 
> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3-mon.node03 (code=exited, 
> status=1/FAILURE)
> Process: 1861 ExecStopPost=/bin/bash 
> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.poststop 
> (code=exited, status=0/SUCCESS)
>Main PID: 1176 (code=exited, status=0/SUCCESS)
> 
> The only fix I could find was to redeploy the mon with :
> 
> ceph orch daemon rm  mon.node03 --force
> ceph orch daemon add mon node03
> 
> However, even if it's working after redeploy, it's not giving me a lot of 
> trust to use it in a production environment having an issue like that.  I 
> could reproduce it with 2 different mons so it's not just an exception.
> 
> My setup is based on Ubuntu 20.04 and docker instead of podman :
> 
> root@node01:~# docker -v
> Docker version 20.10.6, build 370c289
> 
> Do you know a workaround for this issue or is this a known bug ? I noticed 
> that there are some other complaints with the same behaviour in Octopus as 
> well and the solution at that time was to delete the /var/lib/ceph/mon folder 
> .
> 
> 
> Thanks.
> 
> 
> 
> 
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io