[
https://issues.apache.org/jira/browse/CASSANDRA-21189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061011#comment-18061011
]
Sam Lightfoot edited comment on CASSANDRA-21189 at 2/25/26 5:22 PM:
--------------------------------------------------------------------
The triggering error that causes a chain of port errors is from a Paxos commit
that times out:
{code:java}
Caused an ERROR
[2026-02-25T09:27:27.026Z] [junit-timeout]
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Can
not commit transformation: "SERVER_ERROR"(Could not perform commit; policy
Retry{remainingMs=0, attempts=2} gave up). {code}
This timeout is configured on the cluster builder to 1 second (overriding from
10s default)
{code:java}
try (Cluster cluster = builder().withNodes(3)
.appendConfig(cfg ->
cfg.set("progress_barrier_timeout", "5000ms")
.set("request_timeout", "1000ms")
.set("progress_barrier_backoff", "100ms")
{ {code}
The request_timeout effectively becomes the ceiling for the entire Paxos
commit, and because a successful error response is returned, it does not get
retried within the cms_await_timeout budget (significantly larger).
I think a fairly safe option is to increase the 1000ms request_timeout from the
three tests where it is set, or remove it completely, given the resource
constraints of CI.
The following port related issues seem to come from some missing cleanup
behaviour when a cluster is unable to start, thus the following tests after the
initial one fails with SERVER_ERROR all fail with port binding issues.
was (Author: JIRAUSER302824):
The triggering error that causes a chain of port errors is from a Paxos commit
that times out:
{code:java}
Caused an ERROR
[2026-02-25T09:27:27.026Z] [junit-timeout]
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Can
not commit transformation: "SERVER_ERROR"(Could not perform commit; policy
Retry{remainingMs=0, attempts=2} gave up). {code}
This timeout is configured on the cluster builder to 1 second (overriding from
10s default)
{code:java}
try (Cluster cluster = builder().withNodes(3)
.appendConfig(cfg ->
cfg.set("progress_barrier_timeout", "5000ms")
.set("request_timeout", "1000ms")
.set("progress_barrier_backoff", "100ms")
{ {code}
The request_timeout effectively becomes the ceiling for the entire Paxos
commit, and because a successful error response is returned, it does not get
retried within the cms_await_timeout budget (significantly larger).
I think a fairly safe option is to increase the 1000ms request_timeout from the
three tests where it is set, or remove it completely, given the resource
constraints of CI.
> Fix flaky DTest: InProgressSequenceCoordinationTest
> ---------------------------------------------------
>
> Key: CASSANDRA-21189
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21189
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Test/dtest/java
> Reporter: Sam Lightfoot
> Assignee: Sam Lightfoot
> Priority: Normal
> Fix For: 5.1
>
>
> There's a race condition between cluster closing and startup between test
> scenarios due to lack of thread lifecycle handling. The spawned thread should
> be joined before the test finishes to prevent the 'in-use port' errors.
> Affects
> * bootstrapProgressTest
> * decommissionProgressTest
> * replacementProgressTest
> Adopt the same pattern as GossipTest with try-finally thread joining.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]