Some interesting existing work on this subject is "Understanding and Detecting 
Software Upgrade Failures in Distributed Systems" - 
https://dl.acm.org/doi/10.1145/3477132.3483577, also summarized by Andrey 
Satarin here: 
https://asatarin.github.io/talks/2022-09-upgrade-failures-in-distributed-systems/

They specifically tested Cassandra upgrades, and have a solid list of defects 
that they found. They also describe their testing mechanism DUPTester, which 
includes a component that confirms that the leftover state from one version can 
start up on the next version. There is a wider scope of upgrade defects 
highlighted in the paper, beyond SSTable version support.

I believe the project would benefit from expanding our test suite similarly, by 
parametrizing more tests on upgrade version pairs.

Also, per Benedict's comment:

> It’s a commitment, and it requires every contributor to consider it as part 
> of work they produce.

But it shouldn't be a burden. Ability to downgrade is a testable problem, so I 
see this work as a function of the suite of tests the project is willing to 
agree on supporting.

Specifically - I agree with Scott's proposal to emulate the HDFS 
upgrade-then-finalize approach. I would also support automatic finalization 
based on a time threshold or similar, to balance the priorities of safe and 
straightforward upgrades. Users need to be aware of the range of SSTable 
formats supported by a given version, and how to handle when their SSTables 
wouldn't be supported by an upcoming upgrade.

--
Abe

Reply via email to