[ https://issues.apache.org/jira/browse/CASSANDRA-19703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952487#comment-17952487 ]
Andy Tolbert edited comment on CASSANDRA-19703 at 5/19/25 2:42 AM: ------------------------------------------------------------------- I've created branches from 4.0 through trunk and run our internal CI against it. I also ran {{PstmtPersistenceTest}} repeatedly in free-tier pipelines (small resources) to ensure I didn't introduce any flakes: ||PR Branch||CI||JDK 1 repeat||JDK 2 repeat|| |[CASSANDRA-19703-4.0|https://github.com/apache/cassandra/pull/4164]|[^CASSANDRA-19703-4.0-518.html]|[j8|https://app.circleci.com/pipelines/github/tolbertam/cassandra/207/workflows/9bda5ee1-5b80-42a0-915f-ddee69dfb7d5/jobs/18291]|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/207/workflows/897ba6c1-3bd2-48c2-8faf-4d53e952b649/jobs/18287]| |[CASSANDRA-19703-4.1|https://github.com/apache/cassandra/pull/3917]|[^CASSANDRA-19703-4.1-518.html]|[j8|https://app.circleci.com/pipelines/github/tolbertam/cassandra/208/workflows/dc556ec3-9c69-45af-9da0-a3fdc38238a9/jobs/18292]|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/208/workflows/48d90512-0181-405e-a4f7-d24450839aa8/jobs/18288]| |[CASSANDRA-19703-5.0|https://github.com/apache/cassandra/pull/4161]|ci report pending|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/209/workflows/4bcda1d2-03a8-442a-a576-958922f07ff5/jobs/18290]|[j17|https://app.circleci.com/pipelines/github/tolbertam/cassandra/209/workflows/ee06b743-778d-4bdf-9ae0-9b70d7ca1d9c/jobs/18289]| |[CASSANDRA-19703-trunk|https://github.com/apache/cassandra/pull/4160]|[^CASSANDRA-19703-trunk-518.html]|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/206/workflows/7d3b0879-f056-4a35-b846-2c33b5aab03d/jobs/18281]|[j17|https://app.circleci.com/pipelines/github/tolbertam/cassandra/206/workflows/85529fee-9946-4ef7-b824-4a6418699404/jobs/18278]| CI ran pretty well except for several flakes, there doesn't look to be anything where the changes made would be contributing. * {{io.sstable.SSTableReaderTest.testSpannedIndexPositions}} failing on 5.0 and trunk (CASSANDRA-20636) * {{NetstatsBootstrapWithEntireSSTablesCompressionStreamingTest.testWithStreamingEntireSSTablesWithoutCompression}} failing on 4.1, I see CASSANDRA-17345 was closed as cannot reproduce, perhaps it's not reproducible on 5.0 but have seen this fail several times in CI. * 4.0 {{distributed.upgrade.MixedModeAvailabilityV22Test}} failing on 4.0 during shutdown, doesn't like concerning * 5.0: gossip_test.TestGossip which has a history of flakiness (CASSANDRA-19261, CASSANDRA-17366) * 5.0: TestCqlshCopy.test_round_trip_with_rate_file (CASSANDRA-17322) * 5.0: TestLargeColumn.test_cleanup (CASSANDRA-20509 is recent but only addressed on trunk) * Various timeouts in upgrade dtests Will attach the 5.0 report as soon as it finishes. was (Author: andrew.tolbert): I've created branches from 4.0 through trunk and run our internal CI against it. I also ran {{PstmtPersistenceTest}} repeatedly in free-tier pipelines (small resources) to ensure I didn't introduce any flakes: ||PR Branch||CI||JDK 1 repeat||JDK 2 repeat|| |[CASSANDRA-19703-4.0|https://github.com/apache/cassandra/pull/4164]|[^CASSANDRA-19703-4.0-518.html]|[j8|https://app.circleci.com/pipelines/github/tolbertam/cassandra/207/workflows/9bda5ee1-5b80-42a0-915f-ddee69dfb7d5/jobs/18291]|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/207/workflows/897ba6c1-3bd2-48c2-8faf-4d53e952b649/jobs/18287]| |[CASSANDRA-19703-4.1|https://github.com/apache/cassandra/pull/3917]|[^CASSANDRA-19703-4.1-518.html]|[j8|https://app.circleci.com/pipelines/github/tolbertam/cassandra/208/workflows/dc556ec3-9c69-45af-9da0-a3fdc38238a9/jobs/18292]|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/208/workflows/48d90512-0181-405e-a4f7-d24450839aa8/jobs/18288]| |[CASSANDRA-19703-5.0|https://github.com/apache/cassandra/pull/4161]|ci report pending|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/209/workflows/4bcda1d2-03a8-442a-a576-958922f07ff5/jobs/18290]|[j17|https://app.circleci.com/pipelines/github/tolbertam/cassandra/209/workflows/ee06b743-778d-4bdf-9ae0-9b70d7ca1d9c/jobs/18289]| |[CASSANDRA-19703-trunk|https://github.com/apache/cassandra/pull/4160]|[^CASSANDRA-19703-trunk-518.html]|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/206/workflows/7d3b0879-f056-4a35-b846-2c33b5aab03d/jobs/18281]|[j17|https://app.circleci.com/pipelines/github/tolbertam/cassandra/206/workflows/85529fee-9946-4ef7-b824-4a6418699404/jobs/18278]| CI ran pretty well except for several flakes: * {{io.sstable.SSTableReaderTest.testSpannedIndexPositions}} failing on 5.0 and trunk (CASSANDRA-20636) * {{NetstatsBootstrapWithEntireSSTablesCompressionStreamingTest.testWithStreamingEntireSSTablesWithoutCompression}} failing on 4.1, I see CASSANDRA-17345 was closed as cannot reproduce, perhaps it's not reproducible on 5.0 but have seen this fail several times in CI. * 4.0 {{distributed.upgrade.MixedModeAvailabilityV22Test}} failing on 4.0 during shutdown, doesn't like concerning * 5.0: gossip_test.TestGossip which has a history of flakiness (CASSANDRA-19261, CASSANDRA-17366) * 5.0: TestCqlshCopy.test_round_trip_with_rate_file (CASSANDRA-17322) * 5.0: TestLargeColumn.test_cleanup (CASSANDRA-20509 is recent but only addressed on trunk) * Various timeouts in upgrade dtests Will attach the 5.0 report as soon as it finishes. > Newly inserted prepared statements got evicted too early from cache that > leads to race condition > ------------------------------------------------------------------------------------------------ > > Key: CASSANDRA-19703 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19703 > Project: Apache Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown > Reporter: Yuqi Yan > Assignee: Andy Tolbert > Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 6.x > > Attachments: CASSANDRA-19703-4.0-518.html, > CASSANDRA-19703-4.1-511-0_ci_summary.html, CASSANDRA-19703-4.1-518.html, > CASSANDRA-19703-trunk-518.html, ci_summary.html > > Time Spent: 16.5h > Remaining Estimate: 0h > > We're upgrading from Cassandra 4.0 to Cassandra 4.1.3 and > system.prepared_statements table size start growing to GB size after upgrade. > This slows down node startup significantly when it's doing > preloadPreparedStatements > I can't share the exact log but it's a race condition like this: > # [Thread 1] Receives a prepared request for S1. Attempts to get S1 in cache > # [Thread 1] Cache miss, put this S1 into cache > # [Thread 1] Attempts to write S1 into local table > # [Thread 2] Receives a prepared request for S2. Attempts to get S2 in cache > # [Thread 2] Cache miss, put this S2 into cache > # [Thread 2] Cache is full, evicting S1 from cache > # [Thread 2] Attempts to delete S1 from local table > # [Thread 2] Tombstone inserted for S1, delete finished > # [Thread 1] Record inserted for S1, write finished > Thread 2 inserted a tombstone for S1 earlier than Thread 1 was able to insert > the record in the table. Hence the data will not be removed because the later > insert has newer write time than the tombstone. > Whether this would happen or not depends on how the cache decides what’s the > next entry to evict when it’s full. We noticed that in 4.1.3 Caffeine was > upgraded to 2.9.2 CASSANDRA-15153 > > I did a small research in Caffeine commits. It seems this commit was causing > the entry got evicted to early: Eagerly evict an entry if it too large to fit > in the cache(Feb 2021), available after 2.9.0: > [https://github.com/ben-manes/caffeine/commit/464bc1914368c47a0203517fda2151fbedaf568b] > And later fixed in: Improve eviction when overflow or the weight is > oversized(Aug 2022), available after 3.1.2: > [https://github.com/ben-manes/caffeine/commit/25b7d17b1a246a63e4991d4902a2ecf24e86d234] > {quote}Previously an attempt to centralize evictions into one code path led > to a suboptimal approach > ([{{464bc19}}|https://github.com/ben-manes/caffeine/commit/464bc1914368c47a0203517fda2151fbedaf568b] > ). This tried to move those entries into the LRU position for early eviction, > but was confusing and could too aggressively evict something that is > desirable to keep. > {quote} > > I upgrade the Caffeine to 3.1.8 (same as 5.0 trunk) and this issue is gone. > But I think this version is not compatible with Java 8. > I'm not 100% sure if this is the root cause and what's the correct fix here. > Would appreciate if anyone can have a look, thanks > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org