It looks like an old answer from the list just solved my problem!
I found https://www.mail-archive.com/ceph-users@ceph.io/msg14418.html .
So I tried
ceph config rm mds.mds01.ceph03.xqwdjy container_image
ceph config rm mgr.ceph06.xbduuf container_image
And BOOM. It worked.
Thanks for all the
I found something else, that might help with identifying the problem.
When I look into which containers are used I see the following:
global:
quay.io/ceph/ceph@sha256:0560b16bec6e84345f29fb6693cd2430884e6efff16a95d5bdd0bb06d7661c45,
mon:
quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773
I now ran the command on everey host. And I did find two that couldn't
connect. They were the last two I added and never got any daemons. I
fixed that (copied (/etc/ceph and installed cephadm) and rebooted them
but it didn't change a thing for now.
All others could connect to all others withou
Hi,
So sorry I didn't see your reply. Had some tough weeks (father in law
died and that gave us some turmoil) I just came back to debugging and
didn't realize until now that you did in fact answer my e-mail.
I just ran your script on the host that is running the active manager.
Thanks a lot
What caught my eye is that this is also true for Disks on Hosts.
I added another disk to an OSD host. I can zap it with cephadm, I can
even make it an OSD with "ceph orch daemon add osd ceph06:/dev/sdb" and
it will be listed as new OSD in Ceph Dashboard.
But, when I look at the "Physical Disk
Okay, thanks for verifying that bit, sorry to have gone about it so long. I
guess we could look at connection issues next. I wrote a short python
script that tries to connect to hosts using asyncssh closely to how
cephadm does it (
https://github.com/adk3798/testing_scripts/blob/main/asyncssh-conn
I just checked every single host. The only processes of cephadm running
where "cephadm shell" from debugging. I closed all of them, so now I can
verify, there's not a single cephadm process running on any of my ceph
hosts. (and since I found the shell processes, I can verify I didn't
have a typ
If it persisted through a full restart, it's possible the conditions that
caused the hang are still present after the fact. The two known causes I'm
aware of are lack of space in the root partition and hanging mount points.
Both would show up as processes in "ps aux | grep cephadm" though. The
latt
This is why I even tried a full cluster shutdown. All Hosts were out, so
there's not a possibility that there's any process hanging. After I
started the nodes, it's just the same as before. All refresh times show
"4 weeks". Like it stopped simoultanously on all nodes.
Some time ago we had a sm
This is sort of similar to what I said in a previous email, but the only
way I've seen this happen in other setups is through hanging cephadm
commands. The debug process has been, do a mgr failover, wait a few
minutes, see in "ceph orch ps" and "ceph orch device ls" which hosts have
and have not be
Hi,
I tried a lot of different approaches but I didn't have any success so far.
"ceph orch ps" still doesn't get refreshed.
Some examples:
mds.mds01.ceph06.huavsw ceph06 starting -
---
mds.mds01.ceph06.rrxmks ceph06 error
I uploaded the output there:
https://nextcloud.widhalm.or.at/nextcloud/s/FCqPM8zRsix3gss
IP 192.168.23.62 is one of my OSDs that were still booting when the
reconnect tries happened. What makes me wonder is that it's the only one
listed when there are a few similar ones in the cluster.
On 04
what does specifically `ceph log last 200 debug cephadm` spit out? The log
lines you've posted so far I don't think are generated by the orchestrator
so curious what the last actions it took was (and how long ago).
On Thu, May 4, 2023 at 10:35 AM Thomas Widhalm
wrote:
> To completely rule out hu
To completely rule out hung processes, I managed to get another short
shutdown.
Now I'm seeing lots of:
mgr.server handle_open ignoring open from mds.mds01.ceph01.usujbi
v2:192.168.23.61:6800/2922006253; not ready for session (expect reconnect)
mgr finish mon failed to return metadata for mds.
Hi,
What I'm seeing a lot is this: "[stats WARNING root] cmdtag not found
in client metadata" Can't make anything of it but I guess it's not
showing the initial issue.
Now that I think of it - I started the cluster with 3 nodes which are
now only used as OSD. Could it be there's something m
Thanks.
I set the log level to debug, try a few steps and then come back.
On 04.05.23 14:48, Eugen Block wrote:
Hi,
try setting debug logs for the mgr:
ceph config set mgr mgr/cephadm/log_level debug
This should provide more details what the mgr is trying and where it's
failing, hopefully.
Thanks for the reply.
"Refreshed" is "3 weeks ago" on most lines. The running mds and
osd.cost_capacity are both "-" in this column.
I'm already done with "mgr fail", that didn't do anything. And I even
tried a complete shut down during a maintenance windows that was not 3
weeks ago but last
First thing I always check when it seems like orchestrator commands aren't
doing anything is "ceph orch ps" and "ceph orch device ls" and check the
REFRESHED column. If it's well above 10 minutes for orch ps or 30 minutes
for orch device ls, then it means the orchestrator is most likely hanging
on
Hi,
try setting debug logs for the mgr:
ceph config set mgr mgr/cephadm/log_level debug
This should provide more details what the mgr is trying and where it's
failing, hopefully. Last week this helped to identify an issue between
a lower pacific issue for me.
Do you see anything in the ceph
19 matches
Mail list logo