[ceph-users] Re: ceph progress bar stuck and 3rd manager not deploying

2021-09-08 Thread David Orman
This sounds a lot like: https://tracker.ceph.com/issues/51027 which is
fixed in https://github.com/ceph/ceph/pull/42690

David

On Tue, Sep 7, 2021 at 7:31 AM mabi  wrote:
>
> Hello
>
> I have a test ceph octopus 16.2.5 cluster with cephadm out of 7 nodes on 
> Ubuntu 20.04 LTS bare metal. I just upgraded each node's kernel and performed 
> a rolling reboot and now the ceph -s output is stuck somehow and the manager 
> service is only deployed to two nodes instead of 3 nodes. Here would be the 
> ceph -s output:
>
>   cluster:
> id: fb48d256-f43d-11eb-9f74-7fd39d4b232a
> health: HEALTH_WARN
> OSD count 1 < osd_pool_default_size 3
>
>   services:
> mon: 2 daemons, quorum ceph1a,ceph1c (age 25m)
> mgr: ceph1a.guidwn(active, since 25m), standbys: ceph1c.bttxuu
> osd: 1 osds: 1 up (since 30m), 1 in (since 3w)
>
>   data:
> pools:   0 pools, 0 pgs
> objects: 0 objects, 0 B
> usage:   5.3 MiB used, 7.0 TiB / 7.0 TiB avail
> pgs:
>
>   progress:
> Updating crash deployment (-1 -> 6) (0s)
>   []
>
> Ignore the HEALTH_WARN with of the OSD count because I have not finished to 
> deploy all 3 OSDs. But you can see that the progress bar is stuck and I have 
> only 2 managers, the third manager does not seem to start as can be seen here:
>
> $ ceph orch ps|grep stopped
> mon.ceph1bceph1b   stopped   4m ago   4w  
>   -2048M 
>
> It looks like the orchestrator is stuck and does not continue it's job. Any 
> idea how I can get it unstuck?
>
> Best regards,
> Mabi
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph progress bar stuck and 3rd manager not deploying

2021-09-08 Thread David Orman
I forgot to mention, the progress not updating is a seperate bug, you
can fail the mgr (ceph mgr fail ceph1a.guidwn in your example) to
resolve that. On the monitor side, I assume you deployed using labels?
If so - just remove the label from the host where the monitor did not
start, let it fully undeploy, then re-add the label, and it will
redeploy.

On Wed, Sep 8, 2021 at 7:03 AM David Orman  wrote:
>
> This sounds a lot like: https://tracker.ceph.com/issues/51027 which is
> fixed in https://github.com/ceph/ceph/pull/42690
>
> David
>
> On Tue, Sep 7, 2021 at 7:31 AM mabi  wrote:
> >
> > Hello
> >
> > I have a test ceph octopus 16.2.5 cluster with cephadm out of 7 nodes on 
> > Ubuntu 20.04 LTS bare metal. I just upgraded each node's kernel and 
> > performed a rolling reboot and now the ceph -s output is stuck somehow and 
> > the manager service is only deployed to two nodes instead of 3 nodes. Here 
> > would be the ceph -s output:
> >
> >   cluster:
> > id: fb48d256-f43d-11eb-9f74-7fd39d4b232a
> > health: HEALTH_WARN
> > OSD count 1 < osd_pool_default_size 3
> >
> >   services:
> > mon: 2 daemons, quorum ceph1a,ceph1c (age 25m)
> > mgr: ceph1a.guidwn(active, since 25m), standbys: ceph1c.bttxuu
> > osd: 1 osds: 1 up (since 30m), 1 in (since 3w)
> >
> >   data:
> > pools:   0 pools, 0 pgs
> > objects: 0 objects, 0 B
> > usage:   5.3 MiB used, 7.0 TiB / 7.0 TiB avail
> > pgs:
> >
> >   progress:
> > Updating crash deployment (-1 -> 6) (0s)
> >   []
> >
> > Ignore the HEALTH_WARN with of the OSD count because I have not finished to 
> > deploy all 3 OSDs. But you can see that the progress bar is stuck and I 
> > have only 2 managers, the third manager does not seem to start as can be 
> > seen here:
> >
> > $ ceph orch ps|grep stopped
> > mon.ceph1bceph1b   stopped   4m ago   4w
> > -2048M 
> >
> > It looks like the orchestrator is stuck and does not continue it's job. Any 
> > idea how I can get it unstuck?
> >
> > Best regards,
> > Mabi
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph progress bar stuck and 3rd manager not deploying

2021-09-09 Thread mabi
Hello,

A few days later the ceph status progress bar is still stuck and the third mon 
is for some unknown reason still not deploying itself as can be seen from the 
"ceph orch ls" output below:

 ceph orch ls
NAME   PORTSRUNNING  REFRESHED  AGE  PLACEMENT
alertmanager   ?:9093,9094  1/1  3m ago 5w   count:1
crash   7/7  3m ago 5w   *
grafana?:3000   1/1  3m ago 5w   count:1
mgr 2/2  3m ago 4w   count:2;label:mgr
mon 2/3  3m ago 16h  count:3;label:mon
node-exporter  ?:9100   7/7  3m ago 5w   *
osd 1/1  3m ago -
prometheus ?:9095   1/1  3m ago 5w   count:1

Is this a bug in cephadm? and is there a workaround?

Thanks for any hints.

‐‐‐ Original Message ‐‐‐

On Tuesday, September 7th, 2021 at 2:30 PM, mabi  wrote:

> Hello
>
> I have a test ceph octopus 16.2.5 cluster with cephadm out of 7 nodes on 
> Ubuntu 20.04 LTS bare metal. I just upgraded each node's kernel and performed 
> a rolling reboot and now the ceph -s output is stuck somehow and the manager 
> service is only deployed to two nodes instead of 3 nodes. Here would be the 
> ceph -s output:
>
> cluster:
>
> id: fb48d256-f43d-11eb-9f74-7fd39d4b232a
>
> health: HEALTH_WARN
>
> OSD count 1 < osd_pool_default_size 3
>
> services:
>
> mon: 2 daemons, quorum ceph1a,ceph1c (age 25m)
>
> mgr: ceph1a.guidwn(active, since 25m), standbys: ceph1c.bttxuu
>
> osd: 1 osds: 1 up (since 30m), 1 in (since 3w)
>
> data:
>
> pools: 0 pools, 0 pgs
>
> objects: 0 objects, 0 B
>
> usage: 5.3 MiB used, 7.0 TiB / 7.0 TiB avail
>
> pgs:
>
> progress:
>
> Updating crash deployment (-1 -> 6) (0s)
>
>   []
>
>
> Ignore the HEALTH_WARN with of the OSD count because I have not finished to 
> deploy all 3 OSDs. But you can see that the progress bar is stuck and I have 
> only 2 managers, the third manager does not seem to start as can be seen here:
>
> $ ceph orch ps|grep stopped
>
> mon.ceph1b ceph1b stopped 4m ago 4w - 2048M   
>
> It looks like the orchestrator is stuck and does not continue it's job. Any 
> idea how I can get it unstuck?
>
> Best regards,
>
> Mabi
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph progress bar stuck and 3rd manager not deploying

2021-09-09 Thread Eugen Block

You must have missed the response to your thread, I suppose:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YA5KLI5MFJRKVQBKUBG7PJG4RFYLBZFA/


Zitat von mabi :


Hello,

A few days later the ceph status progress bar is still stuck and the  
third mon is for some unknown reason still not deploying itself as  
can be seen from the "ceph orch ls" output below:


 ceph orch ls
NAME   PORTSRUNNING  REFRESHED  AGE  PLACEMENT
alertmanager   ?:9093,9094  1/1  3m ago 5w   count:1
crash   7/7  3m ago 5w   *
grafana?:3000   1/1  3m ago 5w   count:1
mgr 2/2  3m ago 4w   count:2;label:mgr
mon 2/3  3m ago 16h  count:3;label:mon
node-exporter  ?:9100   7/7  3m ago 5w   *
osd 1/1  3m ago -
prometheus ?:9095   1/1  3m ago 5w   count:1

Is this a bug in cephadm? and is there a workaround?

Thanks for any hints.

‐‐‐ Original Message ‐‐‐

On Tuesday, September 7th, 2021 at 2:30 PM, mabi  wrote:


Hello

I have a test ceph octopus 16.2.5 cluster with cephadm out of 7  
nodes on Ubuntu 20.04 LTS bare metal. I just upgraded each node's  
kernel and performed a rolling reboot and now the ceph -s output is  
stuck somehow and the manager service is only deployed to two nodes  
instead of 3 nodes. Here would be the ceph -s output:


cluster:

id: fb48d256-f43d-11eb-9f74-7fd39d4b232a

health: HEALTH_WARN

OSD count 1 < osd_pool_default_size 3

services:

mon: 2 daemons, quorum ceph1a,ceph1c (age 25m)

mgr: ceph1a.guidwn(active, since 25m), standbys: ceph1c.bttxuu

osd: 1 osds: 1 up (since 30m), 1 in (since 3w)

data:

pools: 0 pools, 0 pgs

objects: 0 objects, 0 B

usage: 5.3 MiB used, 7.0 TiB / 7.0 TiB avail

pgs:

progress:

Updating crash deployment (-1 -> 6) (0s)

  []


Ignore the HEALTH_WARN with of the OSD count because I have not  
finished to deploy all 3 OSDs. But you can see that the progress  
bar is stuck and I have only 2 managers, the third manager does not  
seem to start as can be seen here:


$ ceph orch ps|grep stopped

mon.ceph1b ceph1b stopped 4m ago 4w - 2048M   

It looks like the orchestrator is stuck and does not continue it's  
job. Any idea how I can get it unstuck?


Best regards,

Mabi

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph progress bar stuck and 3rd manager not deploying

2021-09-09 Thread David Orman
No problem, and it looks like they will. Glad it worked out for you!

David

On Thu, Sep 9, 2021 at 9:31 AM mabi  wrote:
>
> Thank you Eugen. Indeed the answer went to Spam :(
>
> So thanks to David for his workaround, it worked like a charm. Hopefully 
> these patches can make it into the next pacific release.
>
> ‐‐‐ Original Message ‐‐‐
>
> On Thursday, September 9th, 2021 at 2:33 PM, Eugen Block  
> wrote:
>
> > You must have missed the response to your thread, I suppose:
> >
> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YA5KLI5MFJRKVQBKUBG7PJG4RFYLBZFA/
> >
> > Zitat von mabi m...@protonmail.ch:
> >
> > > Hello,
> > >
> > > A few days later the ceph status progress bar is still stuck and the
> > >
> > > third mon is for some unknown reason still not deploying itself as
> > >
> > > can be seen from the "ceph orch ls" output below:
> > >
> > > ceph orch ls
> > >
> > > NAME PORTS RUNNING REFRESHED AGE PLACEMENT
> > >
> > > alertmanager ?:9093,9094 1/1 3m ago 5w count:1
> > >
> > > crash 7/7 3m ago 5w *
> > >
> > > grafana ?:3000 1/1 3m ago 5w count:1
> > >
> > > mgr 2/2 3m ago 4w count:2;label:mgr
> > >
> > > mon 2/3 3m ago 16h count:3;label:mon
> > >
> > > node-exporter ?:9100 7/7 3m ago 5w *
> > >
> > > osd 1/1 3m ago - 
> > >
> > > prometheus ?:9095 1/1 3m ago 5w count:1
> > >
> > > Is this a bug in cephadm? and is there a workaround?
> > >
> > > Thanks for any hints.
> > >
> > > ‐‐‐ Original Message ‐‐‐
> > >
> > > On Tuesday, September 7th, 2021 at 2:30 PM, mabi m...@protonmail.ch wrote:
> > >
> > > > Hello
> > > >
> > > > I have a test ceph octopus 16.2.5 cluster with cephadm out of 7
> > > >
> > > > nodes on Ubuntu 20.04 LTS bare metal. I just upgraded each node's
> > > >
> > > > kernel and performed a rolling reboot and now the ceph -s output is
> > > >
> > > > stuck somehow and the manager service is only deployed to two nodes
> > > >
> > > > instead of 3 nodes. Here would be the ceph -s output:
> > > >
> > > > cluster:
> > > >
> > > > id: fb48d256-f43d-11eb-9f74-7fd39d4b232a
> > > >
> > > > health: HEALTH_WARN
> > > >
> > > > OSD count 1 < osd_pool_default_size 3
> > > >
> > > > services:
> > > >
> > > > mon: 2 daemons, quorum ceph1a,ceph1c (age 25m)
> > > >
> > > > mgr: ceph1a.guidwn(active, since 25m), standbys: ceph1c.bttxuu
> > > >
> > > > osd: 1 osds: 1 up (since 30m), 1 in (since 3w)
> > > >
> > > > data:
> > > >
> > > > pools: 0 pools, 0 pgs
> > > >
> > > > objects: 0 objects, 0 B
> > > >
> > > > usage: 5.3 MiB used, 7.0 TiB / 7.0 TiB avail
> > > >
> > > > pgs:
> > > >
> > > > progress:
> > > >
> > > > Updating crash deployment (-1 -> 6) (0s)
> > > >
> > > >   []
> > > >
> > > >
> > > > Ignore the HEALTH_WARN with of the OSD count because I have not
> > > >
> > > > finished to deploy all 3 OSDs. But you can see that the progress
> > > >
> > > > bar is stuck and I have only 2 managers, the third manager does not
> > > >
> > > > seem to start as can be seen here:
> > > >
> > > > $ ceph orch ps|grep stopped
> > > >
> > > > mon.ceph1b ceph1b stopped 4m ago 4w - 2048M   
> > > > 
> > > >
> > > > It looks like the orchestrator is stuck and does not continue it's
> > > >
> > > > job. Any idea how I can get it unstuck?
> > > >
> > > > Best regards,
> > > >
> > > > Mabi
> > >
> > > ceph-users mailing list -- ceph-users@ceph.io
> > >
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> > ceph-users mailing list -- ceph-users@ceph.io
> >
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph progress bar stuck and 3rd manager not deploying

2021-09-09 Thread mabi
Thank you Eugen. Indeed the answer went to Spam :(

So thanks to David for his workaround, it worked like a charm. Hopefully these 
patches can make it into the next pacific release.

‐‐‐ Original Message ‐‐‐

On Thursday, September 9th, 2021 at 2:33 PM, Eugen Block  wrote:

> You must have missed the response to your thread, I suppose:
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YA5KLI5MFJRKVQBKUBG7PJG4RFYLBZFA/
>
> Zitat von mabi m...@protonmail.ch:
>
> > Hello,
> >
> > A few days later the ceph status progress bar is still stuck and the
> >
> > third mon is for some unknown reason still not deploying itself as
> >
> > can be seen from the "ceph orch ls" output below:
> >
> > ceph orch ls
> >
> > NAME PORTS RUNNING REFRESHED AGE PLACEMENT
> >
> > alertmanager ?:9093,9094 1/1 3m ago 5w count:1
> >
> > crash 7/7 3m ago 5w *
> >
> > grafana ?:3000 1/1 3m ago 5w count:1
> >
> > mgr 2/2 3m ago 4w count:2;label:mgr
> >
> > mon 2/3 3m ago 16h count:3;label:mon
> >
> > node-exporter ?:9100 7/7 3m ago 5w *
> >
> > osd 1/1 3m ago - 
> >
> > prometheus ?:9095 1/1 3m ago 5w count:1
> >
> > Is this a bug in cephadm? and is there a workaround?
> >
> > Thanks for any hints.
> >
> > ‐‐‐ Original Message ‐‐‐
> >
> > On Tuesday, September 7th, 2021 at 2:30 PM, mabi m...@protonmail.ch wrote:
> >
> > > Hello
> > >
> > > I have a test ceph octopus 16.2.5 cluster with cephadm out of 7
> > >
> > > nodes on Ubuntu 20.04 LTS bare metal. I just upgraded each node's
> > >
> > > kernel and performed a rolling reboot and now the ceph -s output is
> > >
> > > stuck somehow and the manager service is only deployed to two nodes
> > >
> > > instead of 3 nodes. Here would be the ceph -s output:
> > >
> > > cluster:
> > >
> > > id: fb48d256-f43d-11eb-9f74-7fd39d4b232a
> > >
> > > health: HEALTH_WARN
> > >
> > > OSD count 1 < osd_pool_default_size 3
> > >
> > > services:
> > >
> > > mon: 2 daemons, quorum ceph1a,ceph1c (age 25m)
> > >
> > > mgr: ceph1a.guidwn(active, since 25m), standbys: ceph1c.bttxuu
> > >
> > > osd: 1 osds: 1 up (since 30m), 1 in (since 3w)
> > >
> > > data:
> > >
> > > pools: 0 pools, 0 pgs
> > >
> > > objects: 0 objects, 0 B
> > >
> > > usage: 5.3 MiB used, 7.0 TiB / 7.0 TiB avail
> > >
> > > pgs:
> > >
> > > progress:
> > >
> > > Updating crash deployment (-1 -> 6) (0s)
> > >
> > >   []
> > >
> > >
> > > Ignore the HEALTH_WARN with of the OSD count because I have not
> > >
> > > finished to deploy all 3 OSDs. But you can see that the progress
> > >
> > > bar is stuck and I have only 2 managers, the third manager does not
> > >
> > > seem to start as can be seen here:
> > >
> > > $ ceph orch ps|grep stopped
> > >
> > > mon.ceph1b ceph1b stopped 4m ago 4w - 2048M   
> > >
> > > It looks like the orchestrator is stuck and does not continue it's
> > >
> > > job. Any idea how I can get it unstuck?
> > >
> > > Best regards,
> > >
> > > Mabi
> >
> > ceph-users mailing list -- ceph-users@ceph.io
> >
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ceph-users mailing list -- ceph-users@ceph.io
>
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io