[ https://issues.apache.org/jira/browse/CASSANDRA-17519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525860#comment-17525860 ]
Jakub Zytka commented on CASSANDRA-17519: ----------------------------------------- Hi, I've never submitted a patch to Cassandra before, so please bear with me. I attached 3 files: # test that exposes the described problem for `trunk`: [^CASSANDRA-17519-4.1-test-exposing-the-problem.txt] # the actual fix for `trunk`: [^CASSANDRA-17519-4.1-fix.txt] # the test and fix squashed, for cassandra-4.0 (there are slight differences due to resource leak handling): [^CASSANDRA-17519-4.0.txt] I took the liberty of putting comments liberally around the changed code. I think it's a good idea especially due to previous unsuccessful attempts to fix the code. One thing that I did not do, but I think is worth considering is to run the obsoletion code *before* removing the relevant entry from the lookup table. It looks more natural and removes the potential for yet another race condition (currently the obsoletion code must not assume that no other obsoletion code for the same descriptor is running). I understand that this is hardly possible, but I think that in general, it is safer to use the postulated order of execution - first obsoletion, and only then the removal from lookup. [~benedict] you might be interested in doing the review, as you changed the GlobalTidy code recently. (also, [~samt] , who was the reviewer). > races/leaks in SSTableReader::GlobalTidy > ---------------------------------------- > > Key: CASSANDRA-17519 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17519 > Project: Cassandra > Issue Type: Bug > Reporter: Jakub Zytka > Assignee: Jakub Zytka > Priority: Normal > Attachments: CASSANDRA-17519-4.0.txt, CASSANDRA-17519-4.1-fix.txt, > CASSANDRA-17519-4.1-test-exposing-the-problem.txt > > > In Cassandra 4.0/3.11 there are at least two races in > SSTableReader::GlobalTidy > One is a get/get race, explicitly handled as an assertion in: > [http://github.com/apache/cassandra/blob/c22accc46458d0a583afcf6a980f731cdcc94465/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L2199-L2204] > and it looks like "ok, it's a problem, but let's just not fix it" > The other one is get/tidy race between > [http://github.com/apache/cassandra/blob/c22accc46458d0a583afcf6a980f731cdcc94465/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L2194-L2196] > and > [http://github.com/apache/cassandra/blob/c22accc46458d0a583afcf6a980f731cdcc94465/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L2174-L2175] > > The second one can be easily hit by adding a small delay at the beginning of > `tidy()` method (say, 20ms) and running `LongStreamingTest` (and actually > such failure is what prompted the investigation of GlobalTidy correctness) > There was an attempt on `trunk` to fix these two races. > The details are not clear to me, and it all looks quite weird. I might be > mistaken, but as far as I can see the relevant changes were introduced in: > [https://github.com/apache/cassandra/commit/31bea0b0d41e4e81095f0d088094f03db14af490] > that is piggybacked on a huge change in CASSANDRA-17008, without a separate > ticket or any sort of qa. > As far as I can see this attempt changes the first race into a leak, and the > second race to another race, this time allowing to have multiple GlobalTidy > objects for the same sstable. > I'll follow up with PRs for relevant branches etc etc -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org