[jira] [Comment Edited] (CASSANDRA-19703) Newly inserted prepared statements got evicted too early from cache that leads to race condition

Andy Tolbert (Jira) Sun, 18 May 2025 19:43:34 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952487#comment-17952487
 ]


Andy Tolbert edited comment on CASSANDRA-19703 at 5/19/25 2:42 AM:
-------------------------------------------------------------------

I've created branches from 4.0 through trunk and run our internal CI against 
it. I also ran {{PstmtPersistenceTest}} repeatedly in free-tier pipelines 
(small resources) to ensure I didn't introduce any flakes:
||PR Branch||CI||JDK 1 repeat||JDK 2 repeat||
|[CASSANDRA-19703-4.0|https://github.com/apache/cassandra/pull/4164]|[^CASSANDRA-19703-4.0-518.html]|[j8|https://app.circleci.com/pipelines/github/tolbertam/cassandra/207/workflows/9bda5ee1-5b80-42a0-915f-ddee69dfb7d5/jobs/18291]|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/207/workflows/897ba6c1-3bd2-48c2-8faf-4d53e952b649/jobs/18287]|
|[CASSANDRA-19703-4.1|https://github.com/apache/cassandra/pull/3917]|[^CASSANDRA-19703-4.1-518.html]|[j8|https://app.circleci.com/pipelines/github/tolbertam/cassandra/208/workflows/dc556ec3-9c69-45af-9da0-a3fdc38238a9/jobs/18292]|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/208/workflows/48d90512-0181-405e-a4f7-d24450839aa8/jobs/18288]|
|[CASSANDRA-19703-5.0|https://github.com/apache/cassandra/pull/4161]|ci report 
pending|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/209/workflows/4bcda1d2-03a8-442a-a576-958922f07ff5/jobs/18290]|[j17|https://app.circleci.com/pipelines/github/tolbertam/cassandra/209/workflows/ee06b743-778d-4bdf-9ae0-9b70d7ca1d9c/jobs/18289]|
|[CASSANDRA-19703-trunk|https://github.com/apache/cassandra/pull/4160]|[^CASSANDRA-19703-trunk-518.html]|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/206/workflows/7d3b0879-f056-4a35-b846-2c33b5aab03d/jobs/18281]|[j17|https://app.circleci.com/pipelines/github/tolbertam/cassandra/206/workflows/85529fee-9946-4ef7-b824-4a6418699404/jobs/18278]|

CI ran pretty well except for several flakes, there doesn't look to be anything 
where the changes made would be contributing.

 * {{io.sstable.SSTableReaderTest.testSpannedIndexPositions}} failing on 5.0 
and trunk (CASSANDRA-20636)
 * 
{{NetstatsBootstrapWithEntireSSTablesCompressionStreamingTest.testWithStreamingEntireSSTablesWithoutCompression}}
 failing on 4.1, I see CASSANDRA-17345 was closed as cannot reproduce, perhaps 
it's not reproducible on 5.0 but have seen this fail several times in CI.
 * 4.0 {{distributed.upgrade.MixedModeAvailabilityV22Test}} failing on 4.0 
during shutdown, doesn't like concerning
 * 5.0: gossip_test.TestGossip which has a history of flakiness 
(CASSANDRA-19261, CASSANDRA-17366)
 * 5.0: TestCqlshCopy.test_round_trip_with_rate_file (CASSANDRA-17322)
 * 5.0: TestLargeColumn.test_cleanup (CASSANDRA-20509 is recent but only 
addressed on trunk)
 * Various timeouts in upgrade dtests

Will attach the 5.0 report as soon as it finishes.


was (Author: andrew.tolbert):
I've created branches from 4.0 through trunk and run our internal CI against 
it. I also ran {{PstmtPersistenceTest}} repeatedly in free-tier pipelines 
(small resources) to ensure I didn't introduce any flakes:
||PR Branch||CI||JDK 1 repeat||JDK 2 repeat||
|[CASSANDRA-19703-4.0|https://github.com/apache/cassandra/pull/4164]|[^CASSANDRA-19703-4.0-518.html]|[j8|https://app.circleci.com/pipelines/github/tolbertam/cassandra/207/workflows/9bda5ee1-5b80-42a0-915f-ddee69dfb7d5/jobs/18291]|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/207/workflows/897ba6c1-3bd2-48c2-8faf-4d53e952b649/jobs/18287]|
|[CASSANDRA-19703-4.1|https://github.com/apache/cassandra/pull/3917]|[^CASSANDRA-19703-4.1-518.html]|[j8|https://app.circleci.com/pipelines/github/tolbertam/cassandra/208/workflows/dc556ec3-9c69-45af-9da0-a3fdc38238a9/jobs/18292]|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/208/workflows/48d90512-0181-405e-a4f7-d24450839aa8/jobs/18288]|
|[CASSANDRA-19703-5.0|https://github.com/apache/cassandra/pull/4161]|ci report 
pending|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/209/workflows/4bcda1d2-03a8-442a-a576-958922f07ff5/jobs/18290]|[j17|https://app.circleci.com/pipelines/github/tolbertam/cassandra/209/workflows/ee06b743-778d-4bdf-9ae0-9b70d7ca1d9c/jobs/18289]|
|[CASSANDRA-19703-trunk|https://github.com/apache/cassandra/pull/4160]|[^CASSANDRA-19703-trunk-518.html]|[j11|https://app.circleci.com/pipelines/github/tolbertam/cassandra/206/workflows/7d3b0879-f056-4a35-b846-2c33b5aab03d/jobs/18281]|[j17|https://app.circleci.com/pipelines/github/tolbertam/cassandra/206/workflows/85529fee-9946-4ef7-b824-4a6418699404/jobs/18278]|

CI ran pretty well except for several flakes:
 * {{io.sstable.SSTableReaderTest.testSpannedIndexPositions}} failing on 5.0 
and trunk (CASSANDRA-20636)
 * 
{{NetstatsBootstrapWithEntireSSTablesCompressionStreamingTest.testWithStreamingEntireSSTablesWithoutCompression}}
 failing on 4.1, I see CASSANDRA-17345 was closed as cannot reproduce, perhaps 
it's not reproducible on 5.0 but have seen this fail several times in CI.
 * 4.0 {{distributed.upgrade.MixedModeAvailabilityV22Test}} failing on 4.0 
during shutdown, doesn't like concerning
 * 5.0: gossip_test.TestGossip which has a history of flakiness 
(CASSANDRA-19261, CASSANDRA-17366)
 * 5.0: TestCqlshCopy.test_round_trip_with_rate_file (CASSANDRA-17322)
 * 5.0: TestLargeColumn.test_cleanup (CASSANDRA-20509 is recent but only 
addressed on trunk)
 * Various timeouts in upgrade dtests

Will attach the 5.0 report as soon as it finishes.

> Newly inserted prepared statements got evicted too early from cache that 
> leads to race condition
> ------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-19703
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19703
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Local/Startup and Shutdown
>            Reporter: Yuqi Yan
>            Assignee: Andy Tolbert
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0.x, 6.x
>
>         Attachments: CASSANDRA-19703-4.0-518.html, 
> CASSANDRA-19703-4.1-511-0_ci_summary.html, CASSANDRA-19703-4.1-518.html, 
> CASSANDRA-19703-trunk-518.html, ci_summary.html
>
>          Time Spent: 16.5h
>  Remaining Estimate: 0h
>
> We're upgrading from Cassandra 4.0 to Cassandra 4.1.3 and 
> system.prepared_statements table size start growing to GB size after upgrade. 
> This slows down node startup significantly when it's doing 
> preloadPreparedStatements
> I can't share the exact log but it's a race condition like this:
>  # [Thread 1] Receives a prepared request for S1. Attempts to get S1 in cache
>  # [Thread 1] Cache miss, put this S1 into cache
>  # [Thread 1] Attempts to write S1 into local table
>  # [Thread 2] Receives a prepared request for S2. Attempts to get S2 in cache
>  # [Thread 2] Cache miss, put this S2 into cache
>  # [Thread 2] Cache is full, evicting S1 from cache
>  # [Thread 2] Attempts to delete S1 from local table
>  # [Thread 2] Tombstone inserted for S1, delete finished
>  # [Thread 1] Record inserted for S1, write finished
> Thread 2 inserted a tombstone for S1 earlier than Thread 1 was able to insert 
> the record in the table. Hence the data will not be removed because the later 
> insert has newer write time than the tombstone.
> Whether this would happen or not depends on how the cache decides what’s the 
> next entry to evict when it’s full. We noticed that in 4.1.3 Caffeine was 
> upgraded to 2.9.2 CASSANDRA-15153
>  
> I did a small research in Caffeine commits. It seems this commit was causing 
> the entry got evicted to early: Eagerly evict an entry if it too large to fit 
> in the cache(Feb 2021), available after 2.9.0: 
> [https://github.com/ben-manes/caffeine/commit/464bc1914368c47a0203517fda2151fbedaf568b]
> And later fixed in: Improve eviction when overflow or the weight is 
> oversized(Aug 2022), available after 3.1.2: 
> [https://github.com/ben-manes/caffeine/commit/25b7d17b1a246a63e4991d4902a2ecf24e86d234]
> {quote}Previously an attempt to centralize evictions into one code path led 
> to a suboptimal approach 
> ([{{464bc19}}|https://github.com/ben-manes/caffeine/commit/464bc1914368c47a0203517fda2151fbedaf568b]
> ). This tried to move those entries into the LRU position for early eviction, 
> but was confusing and could too aggressively evict something that is 
> desirable to keep.
> {quote}
>  
> I upgrade the Caffeine to 3.1.8 (same as 5.0 trunk) and this issue is gone. 
> But I think this version is not compatible with Java 8.
> I'm not 100% sure if this is the root cause and what's the correct fix here. 
> Would appreciate if anyone can have a look, thanks
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19703) Newly inserted prepared statements got evicted too early from cache that leads to race condition

Reply via email to