[ceph-users] Re: Upgrade from 17.2.5 to 17.2.6 stuck at MDS

2023-04-10 Thread Adam King
You could try pausing the upgrade and manually "upgrading" the mds daemons
by redeploying them on the new image. Something like "ceph orch daemon
redeploy  --image <17.2.6 image>" (daemon names should
match those in "ceph orch ps" output). If you do that for all of them and
then get them into an up state you should be able to resume the upgrade and
have it complete.

On Mon, Apr 10, 2023 at 3:25 PM Thomas Widhalm 
wrote:

> Hi,
>
> If you remember, I hit bug https://tracker.ceph.com/issues/58489 so I
> was very relieved when 17.2.6 was released and started to update
> immediately.
>
> But now I'm stuck again with my broken MDS. MDS won't get into up:active
> without the update but the update waits for them to get into up:active
> state. Seems like a deadlock / chicken-egg problem to me.
>
> Since I'm still relatively new to Ceph, could you help me?
>
> What I see when watching the update status:
>
> {
>  "target_image":
> "
> quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635
> ",
>  "in_progress": true,
>  "which": "Upgrading all daemon types on all hosts",
>  "services_complete": [
>  "crash",
>  "mgr",
> "mon",
> "osd"
>  ],
>  "progress": "18/40 daemons upgraded",
>  "message": "Error: UPGRADE_OFFLINE_HOST: Upgrade: Failed to connect
> to host ceph01 at addr (192.168.23.61)",
>  "is_paused": false
> }
>
> (The offline host was one host that broke during the upgrade. I fixed
> that in the meantime and the update went on.)
>
> And in the log:
>
> 2023-04-10T19:23:48.750129+ mgr.ceph04.qaexpv [INF] Upgrade: Waiting
> for mds.mds01.ceph04.hcmvae to be up:active (currently up:replay)
> 2023-04-10T19:23:58.758141+ mgr.ceph04.qaexpv [WRN] Upgrade: No mds
> is up; continuing upgrade procedure to poke things in the right direction
>
>
> Please give me a hint what I can do.
>
> Cheers,
> Thomas
> --
> http://www.widhalm.or.at
> GnuPG : 6265BAE6 , A84CB603
> Threema: H7AV7D33
> Telegram, Signal: widha...@widhalm.or.at
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade from 17.2.5 to 17.2.6 stuck at MDS

2023-04-10 Thread Adam King
Will also note that the normal upgrade process scales down the mds service
to have only 1 mds per fs before upgrading it, so maybe something you'd
want to do as well if the upgrade didn't do it already. It does so by
setting the max_mds to 1 for the fs.

On Mon, Apr 10, 2023 at 3:51 PM Adam King  wrote:

> You could try pausing the upgrade and manually "upgrading" the mds daemons
> by redeploying them on the new image. Something like "ceph orch daemon
> redeploy  --image <17.2.6 image>" (daemon names should
> match those in "ceph orch ps" output). If you do that for all of them and
> then get them into an up state you should be able to resume the upgrade and
> have it complete.
>
> On Mon, Apr 10, 2023 at 3:25 PM Thomas Widhalm 
> wrote:
>
>> Hi,
>>
>> If you remember, I hit bug https://tracker.ceph.com/issues/58489 so I
>> was very relieved when 17.2.6 was released and started to update
>> immediately.
>>
>> But now I'm stuck again with my broken MDS. MDS won't get into up:active
>> without the update but the update waits for them to get into up:active
>> state. Seems like a deadlock / chicken-egg problem to me.
>>
>> Since I'm still relatively new to Ceph, could you help me?
>>
>> What I see when watching the update status:
>>
>> {
>>  "target_image":
>> "
>> quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635
>> ",
>>  "in_progress": true,
>>  "which": "Upgrading all daemon types on all hosts",
>>  "services_complete": [
>>  "crash",
>>  "mgr",
>> "mon",
>> "osd"
>>  ],
>>  "progress": "18/40 daemons upgraded",
>>  "message": "Error: UPGRADE_OFFLINE_HOST: Upgrade: Failed to connect
>> to host ceph01 at addr (192.168.23.61)",
>>  "is_paused": false
>> }
>>
>> (The offline host was one host that broke during the upgrade. I fixed
>> that in the meantime and the update went on.)
>>
>> And in the log:
>>
>> 2023-04-10T19:23:48.750129+ mgr.ceph04.qaexpv [INF] Upgrade: Waiting
>> for mds.mds01.ceph04.hcmvae to be up:active (currently up:replay)
>> 2023-04-10T19:23:58.758141+ mgr.ceph04.qaexpv [WRN] Upgrade: No mds
>> is up; continuing upgrade procedure to poke things in the right direction
>>
>>
>> Please give me a hint what I can do.
>>
>> Cheers,
>> Thomas
>> --
>> http://www.widhalm.or.at
>> GnuPG : 6265BAE6 , A84CB603
>> Threema: H7AV7D33
>> Telegram, Signal: widha...@widhalm.or.at
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade from 17.2.5 to 17.2.6 stuck at MDS

2023-04-10 Thread Thomas Widhalm

I did what you told me.

I also see in the log, that the command went through:

2023-04-10T19:58:46.522477+ mgr.ceph04.qaexpv [INF] Schedule 
redeploy daemon mds.mds01.ceph06.rrxmks
2023-04-10T20:01:03.360559+ mgr.ceph04.qaexpv [INF] Schedule 
redeploy daemon mds.mds01.ceph05.pqxmvt
2023-04-10T20:01:21.787635+ mgr.ceph04.qaexpv [INF] Schedule 
redeploy daemon mds.mds01.ceph07.omdisd



But the MDS never start. They stay in error state. I tried to redeploy 
and start them a few times. Even restarted one host where a MDS should run.


mds.mds01.ceph03.xqwdjy  ceph03   error   32m ago 
2M-- 
mds.mds01.ceph04.hcmvae  ceph04   error   31m ago 
2h-- 
mds.mds01.ceph05.pqxmvt  ceph05   error   32m ago 
9M-- 
mds.mds01.ceph06.rrxmks  ceph06   error   32m ago 
10w-- 
mds.mds01.ceph07.omdisd  ceph07   error   32m ago 
2M-- 



And other ideas? Or am I missing something.

Cheers,
Thomas

On 10.04.23 21:53, Adam King wrote:
Will also note that the normal upgrade process scales down the mds 
service to have only 1 mds per fs before upgrading it, so maybe 
something you'd want to do as well if the upgrade didn't do it already. 
It does so by setting the max_mds to 1 for the fs.


On Mon, Apr 10, 2023 at 3:51 PM Adam King > wrote:


You could try pausing the upgrade and manually "upgrading" the mds
daemons by redeploying them on the new image. Something like "ceph
orch daemon redeploy  --image <17.2.6 image>"
(daemon names should match those in "ceph orch ps" output). If you
do that for all of them and then get them into an up state you
should be able to resume the upgrade and have it complete.

On Mon, Apr 10, 2023 at 3:25 PM Thomas Widhalm
mailto:widha...@widhalm.or.at>> wrote:

Hi,

If you remember, I hit bug https://tracker.ceph.com/issues/58489
 so I
was very relieved when 17.2.6 was released and started to update
immediately.

But now I'm stuck again with my broken MDS. MDS won't get into
up:active
without the update but the update waits for them to get into
up:active
state. Seems like a deadlock / chicken-egg problem to me.

Since I'm still relatively new to Ceph, could you help me?

What I see when watching the update status:

{
      "target_image":

"quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635 
",
      "in_progress": true,
      "which": "Upgrading all daemon types on all hosts",
      "services_complete": [
          "crash",
          "mgr",
         "mon",
         "osd"
      ],
      "progress": "18/40 daemons upgraded",
      "message": "Error: UPGRADE_OFFLINE_HOST: Upgrade: Failed
to connect
to host ceph01 at addr (192.168.23.61)",
      "is_paused": false
}

(The offline host was one host that broke during the upgrade. I
fixed
that in the meantime and the update went on.)

And in the log:

2023-04-10T19:23:48.750129+ mgr.ceph04.qaexpv [INF] Upgrade:
Waiting
for mds.mds01.ceph04.hcmvae to be up:active (currently up:replay)
2023-04-10T19:23:58.758141+ mgr.ceph04.qaexpv [WRN] Upgrade:
No mds
is up; continuing upgrade procedure to poke things in the right
direction


Please give me a hint what I can do.

Cheers,
Thomas
-- 
http://www.widhalm.or.at 

GnuPG : 6265BAE6 , A84CB603
Threema: H7AV7D33
Telegram, Signal: widha...@widhalm.or.at

___
ceph-users mailing list -- ceph-users@ceph.io

To unsubscribe send an email to ceph-users-le...@ceph.io




OpenPGP_signature
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade from 17.2.5 to 17.2.6 stuck at MDS

2023-04-10 Thread Adam King
It seems like it maybe didn't actually do the redeploy as it should log
something saying it's actually doing it on top of the line saying it
scheduled it. To confirm, the upgrade is paused ("ceph orch upgrade status"
reports is_paused as false)? If so, maybe try doing a mgr failover ("ceph
mgr fail") and then check "ceph orch ps"  and "ceph orch device ls" a few
minutes later and look at the REFRESHED column. If any of those are giving
amounts of time farther back then when you did the failover, there's
probably something going on on the host(s) where it says it hasn't
refreshed recently that's sticking things up (you'd have to go on that host
and look for hanging cephadm commands). Lastly, you could look at the
/var/lib/ceph///unit.run file on the hosts where the
mds daemons are deployed. The (very long) last podman/docker run line in
that file should have the image name of the image the daemon is being
deployed with. So you could use that to confirm if cephadm ever actually
tried a redeploy of the mds with the new image. You could also check the
journal logs for the mds. Cephadm reports the sytemd unit name for the
daemon as part of "cephadm ls" output if you put a copy of the cephadm
binary, un "cephadm ls" with it, grab the systemd unit name for the mds
daemon form that output, you could use that to check the journal logs which
should tell the last restart time and why it's gone down.

On Mon, Apr 10, 2023 at 4:25 PM Thomas Widhalm 
wrote:

> I did what you told me.
>
> I also see in the log, that the command went through:
>
> 2023-04-10T19:58:46.522477+ mgr.ceph04.qaexpv [INF] Schedule
> redeploy daemon mds.mds01.ceph06.rrxmks
> 2023-04-10T20:01:03.360559+ mgr.ceph04.qaexpv [INF] Schedule
> redeploy daemon mds.mds01.ceph05.pqxmvt
> 2023-04-10T20:01:21.787635+ mgr.ceph04.qaexpv [INF] Schedule
> redeploy daemon mds.mds01.ceph07.omdisd
>
>
> But the MDS never start. They stay in error state. I tried to redeploy
> and start them a few times. Even restarted one host where a MDS should run.
>
> mds.mds01.ceph03.xqwdjy  ceph03   error   32m ago
> 2M-- 
> mds.mds01.ceph04.hcmvae  ceph04   error   31m ago
> 2h-- 
> mds.mds01.ceph05.pqxmvt  ceph05   error   32m ago
> 9M-- 
> mds.mds01.ceph06.rrxmks  ceph06   error   32m ago
> 10w-- 
> mds.mds01.ceph07.omdisd  ceph07   error   32m ago
> 2M-- 
>
>
> And other ideas? Or am I missing something.
>
> Cheers,
> Thomas
>
> On 10.04.23 21:53, Adam King wrote:
> > Will also note that the normal upgrade process scales down the mds
> > service to have only 1 mds per fs before upgrading it, so maybe
> > something you'd want to do as well if the upgrade didn't do it already.
> > It does so by setting the max_mds to 1 for the fs.
> >
> > On Mon, Apr 10, 2023 at 3:51 PM Adam King  > > wrote:
> >
> > You could try pausing the upgrade and manually "upgrading" the mds
> > daemons by redeploying them on the new image. Something like "ceph
> > orch daemon redeploy  --image <17.2.6 image>"
> > (daemon names should match those in "ceph orch ps" output). If you
> > do that for all of them and then get them into an up state you
> > should be able to resume the upgrade and have it complete.
> >
> > On Mon, Apr 10, 2023 at 3:25 PM Thomas Widhalm
> > mailto:widha...@widhalm.or.at>> wrote:
> >
> > Hi,
> >
> > If you remember, I hit bug https://tracker.ceph.com/issues/58489
> >  so I
> > was very relieved when 17.2.6 was released and started to update
> > immediately.
> >
> > But now I'm stuck again with my broken MDS. MDS won't get into
> > up:active
> > without the update but the update waits for them to get into
> > up:active
> > state. Seems like a deadlock / chicken-egg problem to me.
> >
> > Since I'm still relatively new to Ceph, could you help me?
> >
> > What I see when watching the update status:
> >
> > {
> >   "target_image":
> > "
> quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635
> <
> http://quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635
> >",
> >   "in_progress": true,
> >   "which": "Upgrading all daemon types on all hosts",
> >   "services_complete": [
> >   "crash",
> >   "mgr",
> >  "mon",
> >  "osd"
> >   ],
> >   "progress": "18/40 daemons upgraded",
> >   "message": "Error: UPGRADE_OFFLINE_HOST: Upgrade: Failed
> > to connect
> > to host ceph01 at addr (192.168.23.61)",
> >   "is_

[ceph-users] Re: Upgrade from 17.2.5 to 17.2.6 stuck at MDS

2023-04-11 Thread Xiubo Li


On 4/11/23 03:24, Thomas Widhalm wrote:

Hi,

If you remember, I hit bug https://tracker.ceph.com/issues/58489 so I 
was very relieved when 17.2.6 was released and started to update 
immediately.



Please note, this fix is not in the v17.2.6 yet in upstream code.

Thanks

- Xiubo


But now I'm stuck again with my broken MDS. MDS won't get into 
up:active without the update but the update waits for them to get into 
up:active state. Seems like a deadlock / chicken-egg problem to me.


Since I'm still relatively new to Ceph, could you help me?

What I see when watching the update status:

{
    "target_image": 
"quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635",

    "in_progress": true,
    "which": "Upgrading all daemon types on all hosts",
    "services_complete": [
    "crash",
    "mgr",
"mon",
"osd"
    ],
    "progress": "18/40 daemons upgraded",
    "message": "Error: UPGRADE_OFFLINE_HOST: Upgrade: Failed to 
connect to host ceph01 at addr (192.168.23.61)",

    "is_paused": false
}

(The offline host was one host that broke during the upgrade. I fixed 
that in the meantime and the update went on.)


And in the log:

2023-04-10T19:23:48.750129+ mgr.ceph04.qaexpv [INF] Upgrade: 
Waiting for mds.mds01.ceph04.hcmvae to be up:active (currently up:replay)
2023-04-10T19:23:58.758141+ mgr.ceph04.qaexpv [WRN] Upgrade: No 
mds is up; continuing upgrade procedure to poke things in the right 
direction



Please give me a hint what I can do.

Cheers,
Thomas

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade from 17.2.5 to 17.2.6 stuck at MDS

2023-04-11 Thread Thomas Widhalm



On 11.04.23 09:16, Xiubo Li wrote:


On 4/11/23 03:24, Thomas Widhalm wrote:

Hi,

If you remember, I hit bug https://tracker.ceph.com/issues/58489 so I 
was very relieved when 17.2.6 was released and started to update 
immediately.



Please note, this fix is not in the v17.2.6 yet in upstream code.



Thanks for the information. I misread the information in the tracker. Do 
you have a predicted schedule for the backport? Or should I go for a 
specific pre-release? I don't want to take chances but I'm desperate 
because my production system is affected and offline for several weeks now.


Thanks,
Thomas


Thanks

- Xiubo


But now I'm stuck again with my broken MDS. MDS won't get into 
up:active without the update but the update waits for them to get into 
up:active state. Seems like a deadlock / chicken-egg problem to me.


Since I'm still relatively new to Ceph, could you help me?

What I see when watching the update status:

{
    "target_image": 
"quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635",

    "in_progress": true,
    "which": "Upgrading all daemon types on all hosts",
    "services_complete": [
    "crash",
    "mgr",
"mon",
"osd"
    ],
    "progress": "18/40 daemons upgraded",
    "message": "Error: UPGRADE_OFFLINE_HOST: Upgrade: Failed to 
connect to host ceph01 at addr (192.168.23.61)",

    "is_paused": false
}

(The offline host was one host that broke during the upgrade. I fixed 
that in the meantime and the update went on.)


And in the log:

2023-04-10T19:23:48.750129+ mgr.ceph04.qaexpv [INF] Upgrade: 
Waiting for mds.mds01.ceph04.hcmvae to be up:active (currently up:replay)
2023-04-10T19:23:58.758141+ mgr.ceph04.qaexpv [WRN] Upgrade: No 
mds is up; continuing upgrade procedure to poke things in the right 
direction



Please give me a hint what I can do.

Cheers,
Thomas

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




OpenPGP_signature
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade from 17.2.5 to 17.2.6 stuck at MDS

2023-04-11 Thread Xiubo Li


On 4/11/23 15:59, Thomas Widhalm wrote:



On 11.04.23 09:16, Xiubo Li wrote:


On 4/11/23 03:24, Thomas Widhalm wrote:

Hi,

If you remember, I hit bug https://tracker.ceph.com/issues/58489 so 
I was very relieved when 17.2.6 was released and started to update 
immediately.



Please note, this fix is not in the v17.2.6 yet in upstream code.



Thanks for the information. I misread the information in the tracker. 
Do you have a predicted schedule for the backport? Or should I go for 
a specific pre-release? I don't want to take chances but I'm desperate 
because my production system is affected and offline for several weeks 
now.


The backport is already queued to review and test, but I am not very 
sure when it can get merged. I am not very sure this can 100% resolve 
your issue in case when you have other corruptions in you production.


Thanks


Thanks,
Thomas


Thanks

- Xiubo


But now I'm stuck again with my broken MDS. MDS won't get into 
up:active without the update but the update waits for them to get 
into up:active state. Seems like a deadlock / chicken-egg problem to 
me.


Since I'm still relatively new to Ceph, could you help me?

What I see when watching the update status:

{
    "target_image": 
"quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635",

    "in_progress": true,
    "which": "Upgrading all daemon types on all hosts",
    "services_complete": [
    "crash",
    "mgr",
"mon",
"osd"
    ],
    "progress": "18/40 daemons upgraded",
    "message": "Error: UPGRADE_OFFLINE_HOST: Upgrade: Failed to 
connect to host ceph01 at addr (192.168.23.61)",

    "is_paused": false
}

(The offline host was one host that broke during the upgrade. I 
fixed that in the meantime and the update went on.)


And in the log:

2023-04-10T19:23:48.750129+ mgr.ceph04.qaexpv [INF] Upgrade: 
Waiting for mds.mds01.ceph04.hcmvae to be up:active (currently 
up:replay)
2023-04-10T19:23:58.758141+ mgr.ceph04.qaexpv [WRN] Upgrade: No 
mds is up; continuing upgrade procedure to poke things in the right 
direction



Please give me a hint what I can do.

Cheers,
Thomas

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade from 17.2.5 to 17.2.6 stuck at MDS

2023-04-12 Thread Thomas Widhalm

Thanks for your detailed explanations! That helped a lot.

All MDS are still in status error. "ceph orch device ls" showed that 
some hosts seem to not have enough space on devices. I wonder why I 
didn't see that in monitoring. Anyway, I'll fix that and then try to 
proceed.


When the backport is available I'll try to upgrade as fast as possible. 
Hopefully that will suffice to restore the system.


Thank you all so far for your great work!

On 4/10/23 23:02, Adam King wrote:
It seems like it maybe didn't actually do the redeploy as it should log 
something saying it's actually doing it on top of the line saying it 
scheduled it. To confirm, the upgrade is paused ("ceph orch upgrade 
status" reports is_paused as false)? If so, maybe try doing a mgr 
failover ("ceph mgr fail") and then check "ceph orch ps"  and "ceph orch 
device ls" a few minutes later and look at the REFRESHED column. If any 
of those are giving amounts of time farther back then when you did the 
failover, there's probably something going on on the host(s) where it 
says it hasn't refreshed recently that's sticking things up (you'd have 
to go on that host and look for hanging cephadm commands). Lastly, you 
could look at the /var/lib/ceph///unit.run file 
on the hosts where the mds daemons are deployed. The (very long) last 
podman/docker run line in that file should have the image name of the 
image the daemon is being deployed with. So you could use that to 
confirm if cephadm ever actually tried a redeploy of the mds with the 
new image. You could also check the journal logs for the mds. 
Cephadm reports the sytemd unit name for the daemon as part of "cephadm 
ls" output if you put a copy of the cephadm binary, un "cephadm ls" with 
it, grab the systemd unit name for the mds daemon form that output, you 
could use that to check the journal logs which should tell the last 
restart time and why it's gone down.


On Mon, Apr 10, 2023 at 4:25 PM Thomas Widhalm > wrote:


I did what you told me.

I also see in the log, that the command went through:

2023-04-10T19:58:46.522477+ mgr.ceph04.qaexpv [INF] Schedule
redeploy daemon mds.mds01.ceph06.rrxmks
2023-04-10T20:01:03.360559+ mgr.ceph04.qaexpv [INF] Schedule
redeploy daemon mds.mds01.ceph05.pqxmvt
2023-04-10T20:01:21.787635+ mgr.ceph04.qaexpv [INF] Schedule
redeploy daemon mds.mds01.ceph07.omdisd


But the MDS never start. They stay in error state. I tried to redeploy
and start them a few times. Even restarted one host where a MDS
should run.

mds.mds01.ceph03.xqwdjy  ceph03               error           32m ago
2M        -        -         
mds.mds01.ceph04.hcmvae  ceph04               error           31m ago
2h        -        -         
mds.mds01.ceph05.pqxmvt  ceph05               error           32m ago
9M        -        -         
mds.mds01.ceph06.rrxmks  ceph06               error           32m ago
10w        -        -         
mds.mds01.ceph07.omdisd  ceph07               error           32m ago
2M        -        -         


And other ideas? Or am I missing something.

Cheers,
Thomas

On 10.04.23 21:53, Adam King wrote:
 > Will also note that the normal upgrade process scales down the mds
 > service to have only 1 mds per fs before upgrading it, so maybe
 > something you'd want to do as well if the upgrade didn't do it
already.
 > It does so by setting the max_mds to 1 for the fs.
 >
 > On Mon, Apr 10, 2023 at 3:51 PM Adam King mailto:adk...@redhat.com>
 > >> wrote:
 >
 >     You could try pausing the upgrade and manually "upgrading"
the mds
 >     daemons by redeploying them on the new image. Something like
"ceph
 >     orch daemon redeploy  --image <17.2.6 image>"
 >     (daemon names should match those in "ceph orch ps" output).
If you
 >     do that for all of them and then get them into an up state you
 >     should be able to resume the upgrade and have it complete.
 >
 >     On Mon, Apr 10, 2023 at 3:25 PM Thomas Widhalm
 >     mailto:widha...@widhalm.or.at>
>> wrote:
 >
 >         Hi,
 >
 >         If you remember, I hit bug
https://tracker.ceph.com/issues/58489

 >         > so I
 >         was very relieved when 17.2.6 was released and started to
update
 >         immediately.
 >
 >         But now I'm stuck again with my broken MDS. MDS won't get
into
 >         up:active
 >         without the update but the update waits for them to get into
 >         up:active
 >         state. Seems like a deadlock / chicken-egg problem to me.
 >
 

[ceph-users] Re: Upgrade from 17.2.5 to 17.2.6 stuck at MDS

2023-04-12 Thread Thomas Widhalm
Sorry - the info about the insufficient space seems like it referred to 
why the devices are not available. So that's just as is should be.


All MDS are still in error state and were refreshed 2d ago. Even right 
after a mgr failover. So it seems, there's something else going on.


One thing that might be essential to debugging is that I have one MDS 
(mds.mds01) but two CephFS pools (cephfs and cephfs-insecure). I can't 
remember why I only had the one and not two. I used the dashboard for 
setting it up and maybe I didn't realize that I needed two. Or do I? I'm 
just reading through documentation and other sources.


On 4/10/23 23:02, Adam King wrote:
It seems like it maybe didn't actually do the redeploy as it should log 
something saying it's actually doing it on top of the line saying it 
scheduled it. To confirm, the upgrade is paused ("ceph orch upgrade 
status" reports is_paused as false)? If so, maybe try doing a mgr 
failover ("ceph mgr fail") and then check "ceph orch ps"  and "ceph orch 
device ls" a few minutes later and look at the REFRESHED column. If any 
of those are giving amounts of time farther back then when you did the 
failover, there's probably something going on on the host(s) where it 
says it hasn't refreshed recently that's sticking things up (you'd have 
to go on that host and look for hanging cephadm commands). Lastly, you 
could look at the /var/lib/ceph///unit.run file 
on the hosts where the mds daemons are deployed. The (very long) last 
podman/docker run line in that file should have the image name of the 
image the daemon is being deployed with. So you could use that to 
confirm if cephadm ever actually tried a redeploy of the mds with the 
new image. You could also check the journal logs for the mds. 
Cephadm reports the sytemd unit name for the daemon as part of "cephadm 
ls" output if you put a copy of the cephadm binary, un "cephadm ls" with 
it, grab the systemd unit name for the mds daemon form that output, you 
could use that to check the journal logs which should tell the last 
restart time and why it's gone down.


On Mon, Apr 10, 2023 at 4:25 PM Thomas Widhalm > wrote:


I did what you told me.

I also see in the log, that the command went through:

2023-04-10T19:58:46.522477+ mgr.ceph04.qaexpv [INF] Schedule
redeploy daemon mds.mds01.ceph06.rrxmks
2023-04-10T20:01:03.360559+ mgr.ceph04.qaexpv [INF] Schedule
redeploy daemon mds.mds01.ceph05.pqxmvt
2023-04-10T20:01:21.787635+ mgr.ceph04.qaexpv [INF] Schedule
redeploy daemon mds.mds01.ceph07.omdisd


But the MDS never start. They stay in error state. I tried to redeploy
and start them a few times. Even restarted one host where a MDS
should run.

mds.mds01.ceph03.xqwdjy  ceph03               error           32m ago
2M        -        -         
mds.mds01.ceph04.hcmvae  ceph04               error           31m ago
2h        -        -         
mds.mds01.ceph05.pqxmvt  ceph05               error           32m ago
9M        -        -         
mds.mds01.ceph06.rrxmks  ceph06               error           32m ago
10w        -        -         
mds.mds01.ceph07.omdisd  ceph07               error           32m ago
2M        -        -         


And other ideas? Or am I missing something.

Cheers,
Thomas

On 10.04.23 21:53, Adam King wrote:
 > Will also note that the normal upgrade process scales down the mds
 > service to have only 1 mds per fs before upgrading it, so maybe
 > something you'd want to do as well if the upgrade didn't do it
already.
 > It does so by setting the max_mds to 1 for the fs.
 >
 > On Mon, Apr 10, 2023 at 3:51 PM Adam King mailto:adk...@redhat.com>
 > >> wrote:
 >
 >     You could try pausing the upgrade and manually "upgrading"
the mds
 >     daemons by redeploying them on the new image. Something like
"ceph
 >     orch daemon redeploy  --image <17.2.6 image>"
 >     (daemon names should match those in "ceph orch ps" output).
If you
 >     do that for all of them and then get them into an up state you
 >     should be able to resume the upgrade and have it complete.
 >
 >     On Mon, Apr 10, 2023 at 3:25 PM Thomas Widhalm
 >     mailto:widha...@widhalm.or.at>
>> wrote:
 >
 >         Hi,
 >
 >         If you remember, I hit bug
https://tracker.ceph.com/issues/58489

 >         > so I
 >         was very relieved when 17.2.6 was released and started to
update
 >         immediately.
 >
 >         But now I'm stuck again with my broken MDS. MDS won't get
into
 >         up:active