[ceph-users] Re: Upgrade from v15.2.16 to v16.2.7 not starting

2022-05-19 Thread Lo Re Giuseppe
Hi Eugen,

After the reboot of the two mgr servers the mgr service is back to normal (not 
restarting every 3 mins) activity.
I noticed that also the trash purge activity was stuck, and after the mgr 
service started to be stable also the purge operations resumed.
Now I guess I'll have to try again the upgrade procedure with cephadm and test 
if this time it starts...

Giuseppe

On 18.05.22, 14:19, "Eugen Block"  wrote:

Do you see anything suspicious in /var/log/ceph/cephadm.log? Also  
check the mgr logs for any hints.


Zitat von Lo Re  Giuseppe :

> Hi,
>
> We have happily tested the upgrade from v15.2.16 to v16.2.7 with  
> cephadm on a test cluster made of 3 nodes and everything went  
> smoothly.
> Today we started the very same operation on the production one (20  
> OSD servers, 720 HDDs) and the upgrade process doesn’t do anything  
> at all…
>
> To be more specific, we have issued the command
>
> ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.7
>
> and soon after “ceph -s” reports
>
> Upgrade to quay.io/ceph/ceph:v16.2.7 (0s)
>   []
>
> But only for few seconds, after that
>
> [root@naret-monitor01 ~]# ceph -s
>   cluster:
> id: 63334166-d991-11eb-99de-40a6b72108d0
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum  
> naret-monitor01,naret-monitor02,naret-monitor03 (age 7d)
> mgr: naret-monitor01.tvddjv(active, since 60s), standbys:  
> naret-monitor02.btynnb
> mds: cephfs:1 {0=cephfs.naret-monitor01.uvevbf=up:active} 2 up:standby
> osd: 760 osds: 760 up (since 6d), 760 in (since 2w)
> rgw: 3 daemons active (cscs-realm.naret-zone.naret-rgw01.qvhhbi,  
> cscs-realm.naret-zone.naret-rgw02.pduagk,  
> cscs-realm.naret-zone.naret-rgw03.aqdkkb)
>
>   task status:
>
>   data:
> pools:   30 pools, 16497 pgs
> objects: 833.14M objects, 3.1 PiB
> usage:   5.0 PiB used, 5.9 PiB / 11 PiB avail
> pgs: 16460 active+clean
>  37active+clean+scrubbing+deep
>
>   io:
> client:   4.7 MiB/s rd, 4.0 MiB/s wr, 122 op/s rd, 47 op/s wr
>
>   progress:
> Removing image fulen-hdd/c991f6fdf41964 from trash (53s)
>   [] (remaining: 81m)
>
>
>
> The command “ceph orch upgrade status” says:
>
> {
> "target_image": "quay.io/ceph/ceph:v16.2.7",
> "in_progress": true,
> "services_complete": [],
> "message": ""
> }
>
> It doesn’t even pull the container image.
> I have tested that the podman pull command works, I was able to pull  
> quay.io/ceph/ceph:v16.2.7.
>
> “ceph -w” and “ceph -W cephadm” don’t report any activity related to  
> the upgrade.
>
>
> Does anyone have seen anything similar?
> Do you have advises on how to understand what’s holding the upgrade  
> process to actually start?
>
> Thanks in advance,
>
> Giuseppe
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade from v15.2.16 to v16.2.7 not starting

2022-05-19 Thread Lo Re Giuseppe

Hi,

I didn’t notice anything suspicious in mgr logs, neither in the cephadm.log one 
(attaching an extract of the latest).
What I have noticed is that one the mgr container, the active one, gets 
restarted about every 3 minutes (as reported by ceph -w)
"""
2022-05-18T15:30:49.883238+0200 mon.naret-monitor01 [INF] Active manager daemon 
naret-monitor01.tvddjv restarted
2022-05-18T15:30:49.889294+0200 mon.naret-monitor01 [INF] Activating manager 
daemon naret-monitor01.tvddjv
2022-05-18T15:30:50.832200+0200 mon.naret-monitor01 [INF] Manager daemon 
naret-monitor01.tvddjv is now available
2022-05-18T15:34:16.979735+0200 mon.naret-monitor01 [INF] Active manager daemon 
naret-monitor01.tvddjv restarted
2022-05-18T15:34:16.985531+0200 mon.naret-monitor01 [INF] Activating manager 
daemon naret-monitor01.tvddjv
2022-05-18T15:34:18.246784+0200 mon.naret-monitor01 [INF] Manager daemon 
naret-monitor01.tvddjv is now available
2022-05-18T15:37:34.576159+0200 mon.naret-monitor01 [INF] Active manager daemon 
naret-monitor01.tvddjv restarted
2022-05-18T15:37:34.582935+0200 mon.naret-monitor01 [INF] Activating manager 
daemon naret-monitor01.tvddjv
2022-05-18T15:37:35.821200+0200 mon.naret-monitor01 [INF] Manager daemon 
naret-monitor01.tvddjv is now available
2022-05-18T15:40:00.000148+0200 mon.naret-monitor01 [INF] overall HEALTH_OK
2022-05-18T15:40:52.456182+0200 mon.naret-monitor01 [INF] Active manager daemon 
naret-monitor01.tvddjv restarted
2022-05-18T15:40:52.461826+0200 mon.naret-monitor01 [INF] Activating manager 
daemon naret-monitor01.tvddjv
2022-05-18T15:40:53.787353+0200 mon.naret-monitor01 [INF] Manager daemon 
naret-monitor01.tvddjv is now available
"""
Attaching also the active mgr proc logs.
The cluster is working fine, but I wonder if this behaviour of mgr/cephadm is 
itself wrong and might cause the stall of the upgrade.

Thanks,

Giuseppe 
 

On 18.05.22, 14:19, "Eugen Block"  wrote:

Do you see anything suspicious in /var/log/ceph/cephadm.log? Also  
check the mgr logs for any hints.


Zitat von Lo Re  Giuseppe :

> Hi,
>
> We have happily tested the upgrade from v15.2.16 to v16.2.7 with  
> cephadm on a test cluster made of 3 nodes and everything went  
> smoothly.
> Today we started the very same operation on the production one (20  
> OSD servers, 720 HDDs) and the upgrade process doesn’t do anything  
> at all…
>
> To be more specific, we have issued the command
>
> ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.7
>
> and soon after “ceph -s” reports
>
> Upgrade to quay.io/ceph/ceph:v16.2.7 (0s)
>   []
>
> But only for few seconds, after that
>
> [root@naret-monitor01 ~]# ceph -s
>   cluster:
> id: 63334166-d991-11eb-99de-40a6b72108d0
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum  
> naret-monitor01,naret-monitor02,naret-monitor03 (age 7d)
> mgr: naret-monitor01.tvddjv(active, since 60s), standbys:  
> naret-monitor02.btynnb
> mds: cephfs:1 {0=cephfs.naret-monitor01.uvevbf=up:active} 2 up:standby
> osd: 760 osds: 760 up (since 6d), 760 in (since 2w)
> rgw: 3 daemons active (cscs-realm.naret-zone.naret-rgw01.qvhhbi,  
> cscs-realm.naret-zone.naret-rgw02.pduagk,  
> cscs-realm.naret-zone.naret-rgw03.aqdkkb)
>
>   task status:
>
>   data:
> pools:   30 pools, 16497 pgs
> objects: 833.14M objects, 3.1 PiB
> usage:   5.0 PiB used, 5.9 PiB / 11 PiB avail
> pgs: 16460 active+clean
>  37active+clean+scrubbing+deep
>
>   io:
> client:   4.7 MiB/s rd, 4.0 MiB/s wr, 122 op/s rd, 47 op/s wr
>
>   progress:
> Removing image fulen-hdd/c991f6fdf41964 from trash (53s)
>   [] (remaining: 81m)
>
>
>
> The command “ceph orch upgrade status” says:
>
> {
> "target_image": "quay.io/ceph/ceph:v16.2.7",
> "in_progress": true,
> "services_complete": [],
> "message": ""
> }
>
> It doesn’t even pull the container image.
> I have tested that the podman pull command works, I was able to pull  
> quay.io/ceph/ceph:v16.2.7.
>
> “ceph -w” and “ceph -W cephadm” don’t report any activity related to  
> the upgrade.
>
>
> Does anyone have seen anything similar?
> Do you have advises on how to understand what’s holding the upgrade  
> process to actually start?
>
> Thanks in advance,
>
> Giuseppe
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade from v15.2.16 to v16.2.7 not starting

2022-05-18 Thread Eugen Block
Do you see anything suspicious in /var/log/ceph/cephadm.log? Also  
check the mgr logs for any hints.



Zitat von Lo Re  Giuseppe :


Hi,

We have happily tested the upgrade from v15.2.16 to v16.2.7 with  
cephadm on a test cluster made of 3 nodes and everything went  
smoothly.
Today we started the very same operation on the production one (20  
OSD servers, 720 HDDs) and the upgrade process doesn’t do anything  
at all…


To be more specific, we have issued the command

ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.7

and soon after “ceph -s” reports

Upgrade to quay.io/ceph/ceph:v16.2.7 (0s)
  []

But only for few seconds, after that

[root@naret-monitor01 ~]# ceph -s
  cluster:
id: 63334166-d991-11eb-99de-40a6b72108d0
health: HEALTH_OK

  services:
mon: 3 daemons, quorum  
naret-monitor01,naret-monitor02,naret-monitor03 (age 7d)
mgr: naret-monitor01.tvddjv(active, since 60s), standbys:  
naret-monitor02.btynnb

mds: cephfs:1 {0=cephfs.naret-monitor01.uvevbf=up:active} 2 up:standby
osd: 760 osds: 760 up (since 6d), 760 in (since 2w)
rgw: 3 daemons active (cscs-realm.naret-zone.naret-rgw01.qvhhbi,  
cscs-realm.naret-zone.naret-rgw02.pduagk,  
cscs-realm.naret-zone.naret-rgw03.aqdkkb)


  task status:

  data:
pools:   30 pools, 16497 pgs
objects: 833.14M objects, 3.1 PiB
usage:   5.0 PiB used, 5.9 PiB / 11 PiB avail
pgs: 16460 active+clean
 37active+clean+scrubbing+deep

  io:
client:   4.7 MiB/s rd, 4.0 MiB/s wr, 122 op/s rd, 47 op/s wr

  progress:
Removing image fulen-hdd/c991f6fdf41964 from trash (53s)
  [] (remaining: 81m)



The command “ceph orch upgrade status” says:

{
"target_image": "quay.io/ceph/ceph:v16.2.7",
"in_progress": true,
"services_complete": [],
"message": ""
}

It doesn’t even pull the container image.
I have tested that the podman pull command works, I was able to pull  
quay.io/ceph/ceph:v16.2.7.


“ceph -w” and “ceph -W cephadm” don’t report any activity related to  
the upgrade.



Does anyone have seen anything similar?
Do you have advises on how to understand what’s holding the upgrade  
process to actually start?


Thanks in advance,

Giuseppe
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io