[ceph-users] Re: Upgrade from v15.2.16 to v16.2.7 not starting
Hi Eugen, After the reboot of the two mgr servers the mgr service is back to normal (not restarting every 3 mins) activity. I noticed that also the trash purge activity was stuck, and after the mgr service started to be stable also the purge operations resumed. Now I guess I'll have to try again the upgrade procedure with cephadm and test if this time it starts... Giuseppe On 18.05.22, 14:19, "Eugen Block" wrote: Do you see anything suspicious in /var/log/ceph/cephadm.log? Also check the mgr logs for any hints. Zitat von Lo Re Giuseppe : > Hi, > > We have happily tested the upgrade from v15.2.16 to v16.2.7 with > cephadm on a test cluster made of 3 nodes and everything went > smoothly. > Today we started the very same operation on the production one (20 > OSD servers, 720 HDDs) and the upgrade process doesn’t do anything > at all… > > To be more specific, we have issued the command > > ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.7 > > and soon after “ceph -s” reports > > Upgrade to quay.io/ceph/ceph:v16.2.7 (0s) > [] > > But only for few seconds, after that > > [root@naret-monitor01 ~]# ceph -s > cluster: > id: 63334166-d991-11eb-99de-40a6b72108d0 > health: HEALTH_OK > > services: > mon: 3 daemons, quorum > naret-monitor01,naret-monitor02,naret-monitor03 (age 7d) > mgr: naret-monitor01.tvddjv(active, since 60s), standbys: > naret-monitor02.btynnb > mds: cephfs:1 {0=cephfs.naret-monitor01.uvevbf=up:active} 2 up:standby > osd: 760 osds: 760 up (since 6d), 760 in (since 2w) > rgw: 3 daemons active (cscs-realm.naret-zone.naret-rgw01.qvhhbi, > cscs-realm.naret-zone.naret-rgw02.pduagk, > cscs-realm.naret-zone.naret-rgw03.aqdkkb) > > task status: > > data: > pools: 30 pools, 16497 pgs > objects: 833.14M objects, 3.1 PiB > usage: 5.0 PiB used, 5.9 PiB / 11 PiB avail > pgs: 16460 active+clean > 37active+clean+scrubbing+deep > > io: > client: 4.7 MiB/s rd, 4.0 MiB/s wr, 122 op/s rd, 47 op/s wr > > progress: > Removing image fulen-hdd/c991f6fdf41964 from trash (53s) > [] (remaining: 81m) > > > > The command “ceph orch upgrade status” says: > > { > "target_image": "quay.io/ceph/ceph:v16.2.7", > "in_progress": true, > "services_complete": [], > "message": "" > } > > It doesn’t even pull the container image. > I have tested that the podman pull command works, I was able to pull > quay.io/ceph/ceph:v16.2.7. > > “ceph -w” and “ceph -W cephadm” don’t report any activity related to > the upgrade. > > > Does anyone have seen anything similar? > Do you have advises on how to understand what’s holding the upgrade > process to actually start? > > Thanks in advance, > > Giuseppe > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Upgrade from v15.2.16 to v16.2.7 not starting
Hi, I didn’t notice anything suspicious in mgr logs, neither in the cephadm.log one (attaching an extract of the latest). What I have noticed is that one the mgr container, the active one, gets restarted about every 3 minutes (as reported by ceph -w) """ 2022-05-18T15:30:49.883238+0200 mon.naret-monitor01 [INF] Active manager daemon naret-monitor01.tvddjv restarted 2022-05-18T15:30:49.889294+0200 mon.naret-monitor01 [INF] Activating manager daemon naret-monitor01.tvddjv 2022-05-18T15:30:50.832200+0200 mon.naret-monitor01 [INF] Manager daemon naret-monitor01.tvddjv is now available 2022-05-18T15:34:16.979735+0200 mon.naret-monitor01 [INF] Active manager daemon naret-monitor01.tvddjv restarted 2022-05-18T15:34:16.985531+0200 mon.naret-monitor01 [INF] Activating manager daemon naret-monitor01.tvddjv 2022-05-18T15:34:18.246784+0200 mon.naret-monitor01 [INF] Manager daemon naret-monitor01.tvddjv is now available 2022-05-18T15:37:34.576159+0200 mon.naret-monitor01 [INF] Active manager daemon naret-monitor01.tvddjv restarted 2022-05-18T15:37:34.582935+0200 mon.naret-monitor01 [INF] Activating manager daemon naret-monitor01.tvddjv 2022-05-18T15:37:35.821200+0200 mon.naret-monitor01 [INF] Manager daemon naret-monitor01.tvddjv is now available 2022-05-18T15:40:00.000148+0200 mon.naret-monitor01 [INF] overall HEALTH_OK 2022-05-18T15:40:52.456182+0200 mon.naret-monitor01 [INF] Active manager daemon naret-monitor01.tvddjv restarted 2022-05-18T15:40:52.461826+0200 mon.naret-monitor01 [INF] Activating manager daemon naret-monitor01.tvddjv 2022-05-18T15:40:53.787353+0200 mon.naret-monitor01 [INF] Manager daemon naret-monitor01.tvddjv is now available """ Attaching also the active mgr proc logs. The cluster is working fine, but I wonder if this behaviour of mgr/cephadm is itself wrong and might cause the stall of the upgrade. Thanks, Giuseppe On 18.05.22, 14:19, "Eugen Block" wrote: Do you see anything suspicious in /var/log/ceph/cephadm.log? Also check the mgr logs for any hints. Zitat von Lo Re Giuseppe : > Hi, > > We have happily tested the upgrade from v15.2.16 to v16.2.7 with > cephadm on a test cluster made of 3 nodes and everything went > smoothly. > Today we started the very same operation on the production one (20 > OSD servers, 720 HDDs) and the upgrade process doesn’t do anything > at all… > > To be more specific, we have issued the command > > ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.7 > > and soon after “ceph -s” reports > > Upgrade to quay.io/ceph/ceph:v16.2.7 (0s) > [] > > But only for few seconds, after that > > [root@naret-monitor01 ~]# ceph -s > cluster: > id: 63334166-d991-11eb-99de-40a6b72108d0 > health: HEALTH_OK > > services: > mon: 3 daemons, quorum > naret-monitor01,naret-monitor02,naret-monitor03 (age 7d) > mgr: naret-monitor01.tvddjv(active, since 60s), standbys: > naret-monitor02.btynnb > mds: cephfs:1 {0=cephfs.naret-monitor01.uvevbf=up:active} 2 up:standby > osd: 760 osds: 760 up (since 6d), 760 in (since 2w) > rgw: 3 daemons active (cscs-realm.naret-zone.naret-rgw01.qvhhbi, > cscs-realm.naret-zone.naret-rgw02.pduagk, > cscs-realm.naret-zone.naret-rgw03.aqdkkb) > > task status: > > data: > pools: 30 pools, 16497 pgs > objects: 833.14M objects, 3.1 PiB > usage: 5.0 PiB used, 5.9 PiB / 11 PiB avail > pgs: 16460 active+clean > 37active+clean+scrubbing+deep > > io: > client: 4.7 MiB/s rd, 4.0 MiB/s wr, 122 op/s rd, 47 op/s wr > > progress: > Removing image fulen-hdd/c991f6fdf41964 from trash (53s) > [] (remaining: 81m) > > > > The command “ceph orch upgrade status” says: > > { > "target_image": "quay.io/ceph/ceph:v16.2.7", > "in_progress": true, > "services_complete": [], > "message": "" > } > > It doesn’t even pull the container image. > I have tested that the podman pull command works, I was able to pull > quay.io/ceph/ceph:v16.2.7. > > “ceph -w” and “ceph -W cephadm” don’t report any activity related to > the upgrade. > > > Does anyone have seen anything similar? > Do you have advises on how to understand what’s holding the upgrade > process to actually start? > > Thanks in advance, > > Giuseppe > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Upgrade from v15.2.16 to v16.2.7 not starting
Do you see anything suspicious in /var/log/ceph/cephadm.log? Also check the mgr logs for any hints. Zitat von Lo Re Giuseppe : Hi, We have happily tested the upgrade from v15.2.16 to v16.2.7 with cephadm on a test cluster made of 3 nodes and everything went smoothly. Today we started the very same operation on the production one (20 OSD servers, 720 HDDs) and the upgrade process doesn’t do anything at all… To be more specific, we have issued the command ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.7 and soon after “ceph -s” reports Upgrade to quay.io/ceph/ceph:v16.2.7 (0s) [] But only for few seconds, after that [root@naret-monitor01 ~]# ceph -s cluster: id: 63334166-d991-11eb-99de-40a6b72108d0 health: HEALTH_OK services: mon: 3 daemons, quorum naret-monitor01,naret-monitor02,naret-monitor03 (age 7d) mgr: naret-monitor01.tvddjv(active, since 60s), standbys: naret-monitor02.btynnb mds: cephfs:1 {0=cephfs.naret-monitor01.uvevbf=up:active} 2 up:standby osd: 760 osds: 760 up (since 6d), 760 in (since 2w) rgw: 3 daemons active (cscs-realm.naret-zone.naret-rgw01.qvhhbi, cscs-realm.naret-zone.naret-rgw02.pduagk, cscs-realm.naret-zone.naret-rgw03.aqdkkb) task status: data: pools: 30 pools, 16497 pgs objects: 833.14M objects, 3.1 PiB usage: 5.0 PiB used, 5.9 PiB / 11 PiB avail pgs: 16460 active+clean 37active+clean+scrubbing+deep io: client: 4.7 MiB/s rd, 4.0 MiB/s wr, 122 op/s rd, 47 op/s wr progress: Removing image fulen-hdd/c991f6fdf41964 from trash (53s) [] (remaining: 81m) The command “ceph orch upgrade status” says: { "target_image": "quay.io/ceph/ceph:v16.2.7", "in_progress": true, "services_complete": [], "message": "" } It doesn’t even pull the container image. I have tested that the podman pull command works, I was able to pull quay.io/ceph/ceph:v16.2.7. “ceph -w” and “ceph -W cephadm” don’t report any activity related to the upgrade. Does anyone have seen anything similar? Do you have advises on how to understand what’s holding the upgrade process to actually start? Thanks in advance, Giuseppe ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io