[ 
https://issues.apache.org/jira/browse/CASSANDRA-17566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550509#comment-17550509
 ] 

Brandon Williams commented on CASSANDRA-17566:
----------------------------------------------

I think you're right about the poll interval, but I don't think any JMX 
messages should be lost here.  That said, I can no longer reproduce that 
failure, but I can now reproduce the original, which may be related.  The crux 
of that issue is that sometimes we can win the race and beat the failure 
detector to marking node2 down when we force the repair, which then correctly 
fails.  I have a branch that ensures node2 is marked down before proceeding:

||Branch||CI||
|[4.1|https://github.com/driftx/cassandra/tree/CASSANDRA-17655-4.1]|[j8|https://app.circleci.com/pipelines/github/driftx/cassandra/510/workflows/89d3d809-3f27-4085-9064-661ba1af16e2],
 
[J11|https://app.circleci.com/pipelines/github/driftx/cassandra/510/workflows/fbec3999-1ca2-4ec3-b691-caf7a834fffd],
 
[+500|https://app.circleci.com/pipelines/github/driftx/cassandra/510/workflows/fbec3999-1ca2-4ec3-b691-caf7a834fffd/jobs/5917]|
|[trunk|https://github.com/driftx/cassandra/tree/CASSANDRA-17655-trunk]|[j8|https://app.circleci.com/pipelines/github/driftx/cassandra/512/workflows/eb4f1761-5777-4136-a21b-86fa38671aec],
 
[j11|https://app.circleci.com/pipelines/github/driftx/cassandra/512/workflows/a72d1336-ad51-46d2-937c-0f7874e533bd],
 
[+500|https://app.circleci.com/pipelines/github/driftx/cassandra/512/workflows/a72d1336-ad51-46d2-937c-0f7874e533bd/jobs/5909]|

And an unpatched [500 
run|https://app.circleci.com/pipelines/github/driftx/cassandra/511/workflows/9c5061c9-46ac-4fb6-82ed-d6116140a97f/jobs/5927]
 for comparison.

> Fix flaky test - 
> org.apache.cassandra.distributed.test.repair.ForceRepairTest.force
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17566
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17566
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>            Priority: Normal
>             Fix For: 4.1-beta, 4.x
>
>
> Seen on jenkins here: 
> [https://ci-cassandra.apache.org/job/Cassandra-trunk/1083/testReport/org.apache.cassandra.distributed.test.repair/ForceRepairTest/force_2/]
>  
> and circle here:
> https://app.circleci.com/pipelines/github/driftx/cassandra/440/workflows/42f936c7-2ede-4fbf-957c-5fb4e461dd90/jobs/5160/tests#failed-test-1
> {noformat}
> junit.framework.AssertionFailedError: nodetool command [repair, 
> distributed_test_keyspace, --force, --full] was not successful
> stdout:
> [2022-04-20 15:11:01,402] Starting repair command #2 
> (1701a090-c0bc-11ec-9898-07c796ce6a49), repairing keyspace 
> distributed_test_keyspace with repair options (parallelism: parallel, primary 
> range: false, incremental: false, job threads: 1, ColumnFamilies: [], 
> dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 3, pull repair: 
> false, force repair: true, optimise streams: false, ignore unreplicated 
> keyspaces: false, repairPaxos: true, paxosOnly: false)
> [2022-04-20 15:11:11,406] Repair command #2 failed with error Did not get 
> replies from all endpoints.
> [2022-04-20 15:11:11,408] Repair command #2 finished with error
> stderr:
> error: Repair job has failed with the error message: Repair command #2 failed 
> with error Did not get replies from all endpoints.. Check the logs on the 
> repair participants for further details
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error message: 
> Repair command #2 failed with error Did not get replies from all endpoints.. 
> Check the logs on the repair participants for further details
>       at 
> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137)
>       at 
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
>       at 
> javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275)
>       at 
> javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352)
>       at 
> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:124)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>       at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to