Hi,
I know the problems that Frank has raised. However, it should also be
mentioned that many critical bugs have been fixed in the major versions.
We are working on the fixes ourselves.
We and others have written a lot of tools for ourselves in the last 10
years to improve migration/update and upgrade paths/strategy.
From version to version, we also test for up to 6 months before putting
them into production.
However, our goal is always to use Ceph versions that still get
backports and on the other hand, only use the features we really need.
Our developers also always aim to bring bug fixes upstream and into the
supported versions.
By the way, regarding performance I recommend the Cephalocon
presentations by Adam and Mark. There you can learn what efforts are
made to improve ceph performance for current and future versions.
Regards, Joachim
___________________________________
ceph ambassador DACH
ceph consultant since 2012
Clyso GmbH - Premier Ceph Foundation Member
https://www.clyso.com/
Am 15.05.23 um 12:11 schrieb Frank Schilder:
What are the main reasons for not upgrading to the latest and greatest?
Because more often than not it isn't.
I guess when you write "latest and greatest" you talk about features. When we admins talk about
"latest and greatest" we talk about stability. The times that one could jump with a production
system onto a "stable" release with the ending .2 are long gone. Anyone who becomes an early
adapter is more and more likely to experience serious issues. Which leads to more admins waiting with
upgrades. Which in turn leads to more bugs discovered only at late releases. Which again makes more admins
postpone an upgrade. A vicious cycle.
A long time ago there was a discussion about exactly this problem and the
admins were pretty much in favor of increasing the release cadence to at least
4 years if not longer. Its simply too many releases with too many serious bugs
not fixed, lately not even during their official life time. Octopus still has
serious bugs but is EOL.
I'm not surprised that admins give up on upgrading entirely and stay on a
version until their system dies.
To give you one from my own experience, upgrading from mimic latest to octopus latest.
This experience almost certainly applies to every upgrade that involves an OSD format
change (the infamous "quick fix" that could take several days per OSD and crush
entire clusters).
There is an OSD conversion involved in this upgrade and we found out that out
of 2 possible upgrade paths, one leads to a heavily performance degraded
cluster with no possibility to recover other than redeploying all OSDs step by
step. Funnily enough, the problematic procedure is the one described in the
documentation - it hasn't been updated until today despite users still getting
caught in this trap.
To give you an idea of what amount of work is now involved in an attempt to
avoid such pitfalls, here our path:
We set up a test cluster with a script producing realistic workload and started
testing an upgrade under load. This took about a month (meaning repeating the
upgrade with a cluster on mimic deployed and populated from scratch every time)
to confirm that we managed to get onto a robust path avoiding a number of
pitfalls along the way - mainly the serious performance degradation due to OSD
conversion, but also an issue with stray entries plus noise. A month! Once we
were convinced that it would work - meaning we did run it a couple of times
without any further issues being discovered, we started upgrading our
production cluster.
Went smooth until we started the OSD conversion of our FS meta data OSDs. They
had a special performance optimized deployment resulting in a large number of
100G OSDs with about 30-40% utilization. These OSDs started crashing with some
weird corruption. Turns out - thanks Igor! - that while spill-over from fast to
slow drive was handled, the other direction was not. Our OSDs crashed because
Octopus apparently required substantially more space on the slow device and
couldn't use the plenty of fast space that was actually available.
The whole thing ended in 3 days of complete downtime and me working 12 hour
days on the weekend. We managed to recover from this only because we had a
larger delivery of hardware already on-site and I could scavenge parts from
there.
So, the story was that after 1 month of testing we still run into 3 days of
downtime, because there was another unannounced change that broke a config that
was working fine for years on mimic.
To say the same thing with different words: major version upgrades have become
very disruptive and require a lot of effort to get halfway right. And I'm not
talking about the deployment system here.
Add to this list the still open cases discussed on the list about MDS dentry
corruption, snapshots disappearing/corrupting together with a lack of good
built-in tools for detection and repair, performance degradation etc. all not
even addressed in pacific. In this state the devs are pushing for pacific
becoming EOL while at the same time the admins become ever more reluctant to
upgrade.
In my specific case, I planned to upgrade at least to pacific this year, but my
time budget simply doesn't allow for the verification of the procedure and
checking that all for us relevant bugs have been addressed. I gave up. Maybe
next year. Maybe then its even a bit closer to rock solid.
So to get back to my starting point, we admins actually value rock solid over features. I know that
this is boring for devs, but nothing is worse than nobody using your latest and greatest - which
probably was the motivation for your question. If the upgrade paths were more solid and things like
the question "why does an OSD conversion not lead to an OSD that is identical to one deployed
freshly" or "where does the performance go" would actually attempted to track down,
we would be much less reluctant to upgrade.
And then, but only then, would the latest and greatest features be of interest.
I will bring it up here again: with the complexity that the code base reached
now, the 2 year release cadence is way too fast, it doesn't provide sufficient
maturity for upgrading fast as well. More and more admins will be several
cycles behind and we are reaching the point where major bugs in so-called EOL
versions will only be discovered before large clusters even reached this
version. Which might become a fundamental blocker to upgrades entirely.
An alternative to increasing the release cadence would be to keep more cycles
in the life-time loop instead of only the last 2 major releases. 4 years really
is nothing when it comes to storage.
Hope this is helpful and puts some light on the mystery why admins don't want
to move.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Konstantin Shalygin <k0...@k0ste.ru>
Sent: Monday, May 15, 2023 10:43 AM
To: Tino Todino
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: CEPH Version choice
Hi,
On 15 May 2023, at 11:37, Tino Todino <ti...@marlan-tech.co.uk> wrote:
What are the main reasons for not upgrading to the latest and greatest?
One of the main reasons - "just can't", because your Ceph-based products will
get worse at real (not benchmark) performance, see [1]
[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2E67NW6BEAVITL4WTAAU3DFLW7LJX477/
k
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io