Cassandra project status update 2022-08-03

Josh McKenzie Wed, 03 Aug 2022 10:17:00 -0700

Greetings everyone! Let's check in on 4.1, see how we're doing:

https://butler.cassandra.apache.org/#/
We had 4 failures on our last run. We've gone back and forth a bit with the 
CASTest failure, a test introduced back in CASSANDRA-12126 @Ignore'd, however 
that showed some legitimate failures that should be addressed by Paxos V2. If 
anyone from the discussion has the cycles (or someone with familiarity with the 
area) could take assignee on the test failure ticket (17461) and responsibility 
for driving it to resolution that would help clarify our efforts there. 
(https://issues.apache.org/jira/browse/CASSANDRA-17461)


Along with that, we saw a failure in 
TopPartitionsTest.testServiceTopPartitionsSingleTable (cdc) and 
TestBootstrap.test_simultaneous_bootstrap (offheap). Given both are specific 
configurations of tests that ran successfully to completion in other 
configurations there's a reasonable chance they're flaky, be it from the logic 
of the test or the CI environment in which they're executing. Neither tickets 
appear to have active JIRA's associated with them in butler or in the kanban 
board 
(https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496&quickFilter=2252)
 so we could use a volunteer here to both create those tickets and to drive 
them.

We're close enough that we're ready to again visit how we want to treat the 
requirement for no flaky failures before we cut beta 
(https://cwiki.apache.org/confluence/display/CASSANDRA/Release+Lifecycle, "No 
flaky tests - All tests (Unit Tests and DTests) should pass consistently"). 
After seeing a couple releases with this requirement (4.0 and now 4.1), I'm 
inclined to agree with the comment from Dinesh that we should revise this 
requirement formally if we're going to effectively release with flaky tests 
anyway; best to be honest with ourselves and acknowledge it's not proving to be 
a forcing function for changing behavior. If this email doesn't see much 
traction on this topic I'll hit up the dev list with a DISCUSS thread on it.

The kanban for 4.1 blockers show us 13 tickets: 
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2455.
 Most of them are assigned and many in progress, however we have 3 unassigned 
if anyone wants to pick those up: 
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2455&quickFilter=2160


[New Contributors Getting Started]
One of the three issues on 4.1 blocker list or either of the 2 failing tests 
listed above would be great areas to focus your attention!

Nuts and bolts / env / etc: here's an explanation of various types of 
contribution: https://cassandra.apache.org/_/community.html#how-to-contribute
An overview of the C* architecture: 
https://cassandra.apache.org/doc/latest/cassandra/architecture/overview.html
And here's our getting started contributing guide: 
https://cassandra.apache.org/_/development/index.html
We hang out in #cassandra-dev on https://the-asf.slack.com, and you can ping 
the @cassandra_mentors alias to reach 13 of us who have volunteered to mentor 
new contributors on the project. Looking forward to seeing you there.


[Dev list Digest]
https://lists.apache.org/[email protected]:lte=2w:

The challenge of our eclectic usage of NULL strikes again with CEP-15. Avi 
opened up a ticket about this with 
https://issues.apache.org/jira/browse/CASSANDRA-17762. Caleb's working on the 
CQL support for multi-partition transactions on 
https://issues.apache.org/jira/browse/CASSANDRA-17719 where the general 
sentiment seems to be "let's go with a SQL-congruent syntax".

Discussion about the potential benefits and downsides of a multi-threaded 
flushing CommitLog continue: 
https://lists.apache.org/thread/5j8ljtpdw3g0gyrx6m31gh1gjdkztclg. As this 
project is quite complex and has very different performance characteristics 
over time (in-memory initially only vs. long-term flushed to disk maintaining 
LSM trees), benchmarking features like this has proven difficult. Anyone with a 
perspective on the cost/benefits or who's interested in balancing that 
complexity vs. functionality feel free to chime in.

An interesting question about inclusivity or exclusivity of token ranges and 
API consistency came up thanks to 
https://issues.apache.org/jira/browse/CASSANDRA-17575. 
https://lists.apache.org/thread/4tm626ffnqlvt4cbmopdfpd8w6fpqscd. This link 
doesn't capture the entire thread for some reason; the most clarifying 
observation to me comes from Jeremiah about the current usage of tokens in the 
tool: "Reading the responses here and taking a step back, I think the current 
behavior of nodetool compact is probably the correct behavior. The main use 
case I can see for using nodetool compact is someone wants to take some sstable 
and compact it with all the overlapping sstables"

And last but not least, Claude Warren is looking for a reviewer on 
https://issues.apache.org/jira/browse/CASSANDRA-14218. Looks like Dinesh was 
flagged on that as reviewer awhile ago.

[CI Trends]
https://butler.cassandra.apache.org/#/

The last three weeks show us ticking up but the reason is not too surprising:

3.0: 10 -> 14
3.11: 15 -> 17
4.0: 1 -> 6
4.1: 5 -> 4
trunk: 5 -> 7

On the 3.0-4.0 branches, this looks to be due to TestRepair failing 
(https://issues.apache.org/jira/browse/CASSANDRA-17701 and 
https://issues.apache.org/jira/browse/CASSANDRA-17702). Neither of those 
tickets yet have an assignee so if anyone has the cycles or context to look 
into them that'd be great.

4.1 failures are slowly but surely contracting.


[Release progress]
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2175

4.1 beta:
We closed out 8 issues in the past couple of weeks. Some test fixes, restarting 
on gossip only nodes (CASSANDRA-17752), adding validation that the new config 
params are structured as we expect in 4.1 for JMX (CASSANDRA-17738), and 
cleaning up a straightforward doubling of the writePreparedStatement call in 
CASSANDRA-17764.

4.1 rc:
Test fix (CASSANDRA-17769)

Been a pretty quiet week on our older branches.

So to sum it up:
- CASTest failures blocking 4.1: 
https://issues.apache.org/jira/browse/CASSANDRA-17461, needs assignee
- Regression on some TestRepair: 
https://issues.apache.org/jira/browse/CASSANDRA-17701 and 
https://issues.apache.org/jira/browse/CASSANDRA-17702, needs assignee
- We should discuss whether we want to cut 4.1 w/known flaky tests in ASF CI or 
if we need to introduce more formal metrics around what "having no flakes" 
means (3, 5, 10 clean runs? Something else?)

Thanks as always everyone; see you on slack.

~Josh

Cassandra project status update 2022-08-03

Reply via email to