[ceph-users] Re: CEPH Version choice

Joachim Kraftmayer - ceph ambassador Mon, 15 May 2023 07:36:49 -0700

Hi,

I know the problems that Frank has raised. However, it should also bementioned that many critical bugs have been fixed in the major versions.We are working on the fixes ourselves.

We and others have written a lot of tools for ourselves in the last 10years to improve migration/update and upgrade paths/strategy.

From version to version, we also test for up to 6 months before puttingthem into production.

However, our goal is always to use Ceph versions that still getbackports and on the other hand, only use the features we really need.Our developers also always aim to bring bug fixes upstream and into thesupported versions.

By the way, regarding performance I recommend the Cephaloconpresentations by Adam and Mark. There you can learn what efforts aremade to improve ceph performance for current and future versions.


Regards, Joachim


___________________________________
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 15.05.23 um 12:11 schrieb Frank Schilder:

What are the main reasons for not upgrading to the latest and greatest?

Because more often than not it isn't.

I guess when you write "latest and greatest" you talk about features. When we admins talk about 
"latest and greatest" we talk about stability. The times that one could jump with a production 
system onto a "stable" release with the ending .2 are long gone. Anyone who becomes an early 
adapter is more and more likely to experience serious issues. Which leads to more admins waiting with 
upgrades. Which in turn leads to more bugs discovered only at late releases. Which again makes more admins 
postpone an upgrade. A vicious cycle.

A long time ago there was a discussion about exactly this problem and the 
admins were pretty much in favor of increasing the release cadence to at least 
4 years if not longer. Its simply too many releases with too many serious bugs 
not fixed, lately not even during their official life time. Octopus still has 
serious bugs but is EOL.

I'm not surprised that admins give up on upgrading entirely and stay on a 
version until their system dies.

To give you one from my own experience, upgrading from mimic latest to octopus latest. 
This experience almost certainly applies to every upgrade that involves an OSD format 
change (the infamous "quick fix" that could take several days per OSD and crush 
entire clusters).

There is an OSD conversion involved in this upgrade and we found out that out 
of 2 possible upgrade paths, one leads to a heavily performance degraded 
cluster with no possibility to recover other than redeploying all OSDs step by 
step. Funnily enough, the problematic procedure is the one described in the 
documentation - it hasn't been updated until today despite users still getting 
caught in this trap.

To give you an idea of what amount of work is now involved in an attempt to 
avoid such pitfalls, here our path:

We set up a test cluster with a script producing realistic workload and started 
testing an upgrade under load. This took about a month (meaning repeating the 
upgrade with a cluster on mimic deployed and populated from scratch every time) 
to confirm that we managed to get onto a robust path avoiding a number of 
pitfalls along the way - mainly the serious performance degradation due to OSD 
conversion, but also an issue with stray entries plus noise. A month! Once we 
were convinced that it would work - meaning we did run it a couple of times 
without any further issues being discovered, we started upgrading our 
production cluster.

Went smooth until we started the OSD conversion of our FS meta data OSDs. They 
had a special performance optimized deployment resulting in a large number of 
100G OSDs with about 30-40% utilization. These OSDs started crashing with some 
weird corruption. Turns out - thanks Igor! - that while spill-over from fast to 
slow drive was handled, the other direction was not. Our OSDs crashed because 
Octopus apparently required substantially more space on the slow device and 
couldn't use the plenty of fast space that was actually available.

The whole thing ended in 3 days of complete downtime and me working 12 hour 
days on the weekend. We managed to recover from this only because we had a 
larger delivery of hardware already on-site and I could scavenge parts from 
there.

So, the story was that after 1 month of testing we still run into 3 days of 
downtime, because there was another unannounced change that broke a config that 
was working fine for years on mimic.

To say the same thing with different words: major version upgrades have become 
very disruptive and require a lot of effort to get halfway right. And I'm not 
talking about the deployment system here.

Add to this list the still open cases discussed on the list about MDS dentry 
corruption, snapshots disappearing/corrupting together with a lack of good 
built-in tools for detection and repair, performance degradation etc. all not 
even addressed in pacific. In this state the devs are pushing for pacific 
becoming EOL while at the same time the admins become ever more reluctant to 
upgrade.

In my specific case, I planned to upgrade at least to pacific this year, but my 
time budget simply doesn't allow for the verification of the procedure and 
checking that all for us relevant bugs have been addressed. I gave up. Maybe 
next year. Maybe then its even a bit closer to rock solid.

So to get back to my starting point, we admins actually value rock solid over features. I know that 
this is boring for devs, but nothing is worse than nobody using your latest and greatest - which 
probably was the motivation for your question. If the upgrade paths were more solid and things like 
the question "why does an OSD conversion not lead to an OSD that is identical to one deployed 
freshly" or "where does the performance go" would actually attempted to track down, 
we would be much less reluctant to upgrade.

And then, but only then, would the latest and greatest features be of interest.

I will bring it up here again: with the complexity that the code base reached 
now, the 2 year release cadence is way too fast, it doesn't provide sufficient 
maturity for upgrading fast as well. More and more admins will be several 
cycles behind and we are reaching the point where major bugs in so-called EOL 
versions will only be discovered before large clusters even reached this 
version. Which might become a fundamental blocker to upgrades entirely.

An alternative to increasing the release cadence would be to keep more cycles 
in the life-time loop instead of only the last 2 major releases. 4 years really 
is nothing when it comes to storage.

Hope this is helpful and puts some light on the mystery why admins don't want 
to move.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Konstantin Shalygin <k0...@k0ste.ru>
Sent: Monday, May 15, 2023 10:43 AM
To: Tino Todino
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: CEPH Version choice

Hi,

On 15 May 2023, at 11:37, Tino Todino <ti...@marlan-tech.co.uk> wrote:

What are the main reasons for not upgrading to the latest and greatest?

One of the main reasons - "just can't", because your Ceph-based products will 
get worse at real (not benchmark) performance, see [1]


[1] 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2E67NW6BEAVITL4WTAAU3DFLW7LJX477/


k
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CEPH Version choice

Reply via email to