Some interesting existing work on this subject is "Understanding and Detecting Software Upgrade Failures in Distributed Systems" - https://dl.acm.org/doi/10.1145/3477132.3483577, also summarized by Andrey Satarin here: https://asatarin.github.io/talks/2022-09-upgrade-failures-in-distributed-systems/
They specifically tested Cassandra upgrades, and have a solid list of defects that they found. They also describe their testing mechanism DUPTester, which includes a component that confirms that the leftover state from one version can start up on the next version. There is a wider scope of upgrade defects highlighted in the paper, beyond SSTable version support. I believe the project would benefit from expanding our test suite similarly, by parametrizing more tests on upgrade version pairs. Also, per Benedict's comment: > It’s a commitment, and it requires every contributor to consider it as part > of work they produce. But it shouldn't be a burden. Ability to downgrade is a testable problem, so I see this work as a function of the suite of tests the project is willing to agree on supporting. Specifically - I agree with Scott's proposal to emulate the HDFS upgrade-then-finalize approach. I would also support automatic finalization based on a time threshold or similar, to balance the priorities of safe and straightforward upgrades. Users need to be aware of the range of SSTable formats supported by a given version, and how to handle when their SSTables wouldn't be supported by an upcoming upgrade. -- Abe