[ceph-users] Re: Orchestration seems not to work

2023-06-07 Thread Thomas Widhalm
It looks like an old answer from the list just solved my problem! I found https://www.mail-archive.com/ceph-users@ceph.io/msg14418.html . So I tried ceph config rm mds.mds01.ceph03.xqwdjy container_image ceph config rm mgr.ceph06.xbduuf container_image And BOOM. It worked. Thanks for all the

[ceph-users] Re: Orchestration seems not to work

2023-06-07 Thread Thomas Widhalm
I found something else, that might help with identifying the problem. When I look into which containers are used I see the following: global: quay.io/ceph/ceph@sha256:0560b16bec6e84345f29fb6693cd2430884e6efff16a95d5bdd0bb06d7661c45, mon:

[ceph-users] Re: Orchestration seems not to work

2023-05-25 Thread Thomas Widhalm
I now ran the command on everey host. And I did find two that couldn't connect. They were the last two I added and never got any daemons. I fixed that (copied (/etc/ceph and installed cephadm) and rebooted them but it didn't change a thing for now. All others could connect to all others

[ceph-users] Re: Orchestration seems not to work

2023-05-25 Thread Thomas Widhalm
Hi, So sorry I didn't see your reply. Had some tough weeks (father in law died and that gave us some turmoil) I just came back to debugging and didn't realize until now that you did in fact answer my e-mail. I just ran your script on the host that is running the active manager. Thanks a lot

[ceph-users] Re: Orchestration seems not to work

2023-05-25 Thread Thomas Widhalm
What caught my eye is that this is also true for Disks on Hosts. I added another disk to an OSD host. I can zap it with cephadm, I can even make it an OSD with "ceph orch daemon add osd ceph06:/dev/sdb" and it will be listed as new OSD in Ceph Dashboard. But, when I look at the "Physical

[ceph-users] Re: Orchestration seems not to work

2023-05-15 Thread Adam King
Okay, thanks for verifying that bit, sorry to have gone about it so long. I guess we could look at connection issues next. I wrote a short python script that tries to connect to hosts using asyncssh closely to how cephadm does it (

[ceph-users] Re: Orchestration seems not to work

2023-05-15 Thread Thomas Widhalm
I just checked every single host. The only processes of cephadm running where "cephadm shell" from debugging. I closed all of them, so now I can verify, there's not a single cephadm process running on any of my ceph hosts. (and since I found the shell processes, I can verify I didn't have a

[ceph-users] Re: Orchestration seems not to work

2023-05-15 Thread Adam King
If it persisted through a full restart, it's possible the conditions that caused the hang are still present after the fact. The two known causes I'm aware of are lack of space in the root partition and hanging mount points. Both would show up as processes in "ps aux | grep cephadm" though. The

[ceph-users] Re: Orchestration seems not to work

2023-05-15 Thread Thomas Widhalm
This is why I even tried a full cluster shutdown. All Hosts were out, so there's not a possibility that there's any process hanging. After I started the nodes, it's just the same as before. All refresh times show "4 weeks". Like it stopped simoultanously on all nodes. Some time ago we had a

[ceph-users] Re: Orchestration seems not to work

2023-05-15 Thread Adam King
This is sort of similar to what I said in a previous email, but the only way I've seen this happen in other setups is through hanging cephadm commands. The debug process has been, do a mgr failover, wait a few minutes, see in "ceph orch ps" and "ceph orch device ls" which hosts have and have not

[ceph-users] Re: Orchestration seems not to work

2023-05-15 Thread Thomas Widhalm
Hi, I tried a lot of different approaches but I didn't have any success so far. "ceph orch ps" still doesn't get refreshed. Some examples: mds.mds01.ceph06.huavsw ceph06 starting - --- mds.mds01.ceph06.rrxmks ceph06 error

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Thomas Widhalm
I uploaded the output there: https://nextcloud.widhalm.or.at/nextcloud/s/FCqPM8zRsix3gss IP 192.168.23.62 is one of my OSDs that were still booting when the reconnect tries happened. What makes me wonder is that it's the only one listed when there are a few similar ones in the cluster. On

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Adam King
what does specifically `ceph log last 200 debug cephadm` spit out? The log lines you've posted so far I don't think are generated by the orchestrator so curious what the last actions it took was (and how long ago). On Thu, May 4, 2023 at 10:35 AM Thomas Widhalm wrote: > To completely rule out

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Thomas Widhalm
To completely rule out hung processes, I managed to get another short shutdown. Now I'm seeing lots of: mgr.server handle_open ignoring open from mds.mds01.ceph01.usujbi v2:192.168.23.61:6800/2922006253; not ready for session (expect reconnect) mgr finish mon failed to return metadata for

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Thomas Widhalm
Hi, What I'm seeing a lot is this: "[stats WARNING root] cmdtag not found in client metadata" Can't make anything of it but I guess it's not showing the initial issue. Now that I think of it - I started the cluster with 3 nodes which are now only used as OSD. Could it be there's something

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Thomas Widhalm
Thanks. I set the log level to debug, try a few steps and then come back. On 04.05.23 14:48, Eugen Block wrote: Hi, try setting debug logs for the mgr: ceph config set mgr mgr/cephadm/log_level debug This should provide more details what the mgr is trying and where it's failing, hopefully.

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Thomas Widhalm
Thanks for the reply. "Refreshed" is "3 weeks ago" on most lines. The running mds and osd.cost_capacity are both "-" in this column. I'm already done with "mgr fail", that didn't do anything. And I even tried a complete shut down during a maintenance windows that was not 3 weeks ago but

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Adam King
First thing I always check when it seems like orchestrator commands aren't doing anything is "ceph orch ps" and "ceph orch device ls" and check the REFRESHED column. If it's well above 10 minutes for orch ps or 30 minutes for orch device ls, then it means the orchestrator is most likely hanging on

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Eugen Block
Hi, try setting debug logs for the mgr: ceph config set mgr mgr/cephadm/log_level debug This should provide more details what the mgr is trying and where it's failing, hopefully. Last week this helped to identify an issue between a lower pacific issue for me. Do you see anything in the