[
https://issues.apache.org/jira/browse/CASSANDRA-21410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Capwell updated CASSANDRA-21410:
--------------------------------------
Reviewers: Benedict Elliott Smith (was: Benedict Elliott Smith, David
Capwell)
> ShardDurability.markDefunct() called O(N²) times across topology updates,
> causing log spam and OOM in tests
> -----------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-21410
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21410
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Accord
> Reporter: David Capwell
> Assignee: David Capwell
> Priority: Normal
> Fix For: 6.0.x
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> ShardDurability.updateTopology() has a bug where defunct schedulers
> accumulate in the shardSchedulers map and are re-marked defunct on every
> subsequent topology change, producing O(N²) log messages.
> The issue is in updateTopology():
> {code}
> shardSchedulers.putAll(prev); // puts defunct schedulers back into
> the map
> prev.forEach((r, s) -> s.markDefunct()); // marks them defunct (again)
> {code}
> When a topology change removes a shard range, its scheduler is marked defunct
> but kept in shardSchedulers (via putAll) so it can finish in-flight work
> before self-removing. However, on the next topology change, these
> already-defunct schedulers are copied into the new prev map, survive the
> removal loop (their range doesn't exist in the new topology), and get
> markDefunct() called again. Every subsequent topology change re-processes all
> previously-defunct schedulers that haven't yet self-removed.
> With N topology changes, markDefunct() is called 1 + 2 + 3 + ... + N =
> N*(N+1)/2 times total.
> This was observed in CI running ShortReadProtectionTest, which is
> parameterized with 24 combinations x 15 test methods = 360 iterations, each
> creating a new table (and thus a new topology epoch). With
> accord.shard_durability_target_splits=4, ShardDurability.java:173 produced
> 173,534 INFO-level log lines across an 11-minute test run. The JUnit test
> formatter buffers all stdout in a ByteArrayOutputStream with no size cap, and
> the accumulated ~155 MiB of log output exhausted the 1G test JVM heap,
> causing an OOM.
> This ticket / patch was generated by Opus 4.6
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]