[ceph-users] Re: cephadm orchestrator does not restart daemons [was: ceph orch upgrade stuck between 16.2.7 and 16.2.13]

2023-08-16 Thread Adam King
I've seen this before where the ceph-volume process hanging causes the
whole serve loop to get stuck (we have a patch to get it to timeout
properly in reef and are backporting to quincy but nothing for pacific
unfortunately). That's why I was asking about the REFRESHED column in the
orch ps/ orch device ls output. Typically when this happens it presents as
the REFRESHED column reporting not having refreshed anything since the
ceph-volume process started hanging. Either way, if you killed those
ceph-volume processes and any new ones aren't hanging and the serve loop is
running okay I'd expect the issues to clear up. This could (and most likely
did) cause both the daemon restarts to not happen and the upgrade to not
progress.

On Wed, Aug 16, 2023 at 8:50 AM Robert Sander 
wrote:

> On 8/16/23 12:10, Eugen Block wrote:
> > I don't really have a good idea right now, but there was a thread [1]
> > about ssh sessions that are not removed, maybe that could have such an
> > impact? And if you crank up the debug level to 30, do you see anything
> > else?
>
> It was something similar. There were leftover ceph-volume processes
> running on some of the OSD nodes. After killing them the cephadm
> orchestrator is now able to resume the upgrade.
>
> As we also restarted the MGR processes (with systemctl restart
> CONTAINER) there were no leftover SSH sessions.
>
> But the still running ceph-volume processes must have used a lock that
> blocked new cephadm commands.
>
> Regards
> --
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> https://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm orchestrator does not restart daemons [was: ceph orch upgrade stuck between 16.2.7 and 16.2.13]

2023-08-16 Thread Eugen Block
Great, thanks for the update! Just yesterday I wanted to cleanup a  
couple of test clusters and remove some old container images which  
seemed to still be in use although several upgrades had been  
processed. Those were quite old ceph-volume inventory processes,  
dating back to the initial cluster bootstrap. But obviously, they  
didn't have such an impact as you describe. Anyway, good to know that  
it's not a major issue so I can upgrade our cluster as well. Although  
I'm waiting for a PR that still didn't make it into latest pacific, so  
maybe I'll wait for a bit longer.


Thanks!
Eugen

Zitat von Robert Sander :


On 8/16/23 12:10, Eugen Block wrote:

I don't really have a good idea right now, but there was a thread [1]
about ssh sessions that are not removed, maybe that could have such an
impact? And if you crank up the debug level to 30, do you see anything
else?


It was something similar. There were leftover ceph-volume processes  
running on some of the OSD nodes. After killing them the cephadm  
orchestrator is now able to resume the upgrade.


As we also restarted the MGR processes (with systemctl restart  
CONTAINER) there were no leftover SSH sessions.


But the still running ceph-volume processes must have used a lock  
that blocked new cephadm commands.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm orchestrator does not restart daemons [was: ceph orch upgrade stuck between 16.2.7 and 16.2.13]

2023-08-16 Thread Robert Sander

On 8/16/23 12:10, Eugen Block wrote:

I don't really have a good idea right now, but there was a thread [1]
about ssh sessions that are not removed, maybe that could have such an
impact? And if you crank up the debug level to 30, do you see anything
else?


It was something similar. There were leftover ceph-volume processes 
running on some of the OSD nodes. After killing them the cephadm 
orchestrator is now able to resume the upgrade.


As we also restarted the MGR processes (with systemctl restart 
CONTAINER) there were no leftover SSH sessions.


But the still running ceph-volume processes must have used a lock that 
blocked new cephadm commands.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm orchestrator does not restart daemons [was: ceph orch upgrade stuck between 16.2.7 and 16.2.13]

2023-08-16 Thread Eugen Block
I don't really have a good idea right now, but there was a thread [1]  
about ssh sessions that are not removed, maybe that could have such an  
impact? And if you crank up the debug level to 30, do you see anything  
else?


ceph config set mgr debug_mgr 30


[1]  
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/I452F3PWBAU47R6RXMVSLIHKS4YWCFKT/


Zitat von Robert Sander :


On 8/15/23 16:36, Adam King wrote:

with the log to cluster level already on debug, if you do a "ceph  
mgr fail" what does cephadm log to the cluster before it reports  
sleeping? It should at least be doing something if it's responsive  
at all. Also, in "ceph orch ps"  and "ceph orch device ls" are the  
REFRESHED columns reporting that they've refreshed the info  
recently (last 10 minutes for daemons, last 30 minutes for devices)?


They have been refreshed very recently.

The issue seems to be a bit larger than just the not working upgrade.

We are now not even able to restart a daemon.

When I issue the command

# ceph orch daemon restart crash.cephmon01

these two lines show up in the cephadm log but nothing else happens:

2023-08-16T10:35:41.640027+0200 mgr.cephmon01 [INF] Schedule restart  
daemon crash.cephmon01

2023-08-16T10:35:41.640497+0200 mgr.cephmon01 [DBG] _kick_serve_loop

The container for crash.cephmon01 does not get restarted.

It looks like the service loop does not get executed.

Can we see what jobs are in this queue and why they do not get executed?

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io