After a test on a non production environment, we decided to upgrade our
running cluster to jewel 10.2.3. Our cluster has 3 monitors and 8 nodes of
20 disks. The cluster is in hammer 0.94.5 with tunables set to "bobtail".
As the cluster is in production and it wasn't possible to upgrade ceph
client at the same time, so we decided to keep the tunables in bobtail
.
First step was to upgrade the three monitors : no problem
Second step : put the cluster in noout and then upgrade the first node : as
soon as we stopped the OSDs on the first node, the cluster went in error
with a lot of PG peering. We lost a lot of disks on the VMs hosted by ceph
clients.A lot of OSD went flapping (down then up) for hours.
So we decide to stop all the VMs and so the IOs on the cluster for it to
stabilize and it took about 3 hours. With no IO on the cluster, we arrived
to upgrade 4 nodes (on 8).

At this time, we have pools which are spread only on these 4 nodes now on
jewel. But still now, if we stop an OSD on one of this 4 nodes, PGs are
still peering and the cluster is in error status and then not good to serve
the production needs.
Is this behaviour because of a mix of node in jewel and in hammer ?

We will upgrade the last 4 nodes next week-end so all the OSD nodes will be
in jewel. Do we have to wait for the ceph clients upgrade in jewel to
recover a stable cluster ? Do we have to wait till the tunables are set to
optimal ?

I saw in the release notes that an upgrade from hammer to jewel could be
done without downtime ...I known that there is no garanty but for now, we
still have an unstable cluster and pray for not loosing an OSD before the
last operation of upgrade

If you have somme advices, i'll take tem :)

Vincent
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to