[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291052#comment-15291052 ]
Marcus Eriksson commented on CASSANDRA-11824: --------------------------------------------- Problem occurs when the repair coordinator dies - then the repairing nodes won't clear out the ParentRepairSessions My approach is to have ActiveRepairService start listening for endpoint changes and failure detector events, so for example: * a cluster with A, B, C, we trigger repair against A. * during repair, A dies * B, C gets notified about this and marks the ParentRepairSession as failed. It gets a bit tricky as node A might not have realized that it was down and just continues with its repair, so we keep a 'failed' version of the parent repair session around for 24h on B and C, so if anyone tries to get that (say node A continues sending validation requests for example) we throw an exception which will fail the repair on node A as well A dtest to reproduce the error: https://github.com/krummas/cassandra-dtest/commits/marcuse/11824 ||branch||testall||dtest|| |[marcuse/11824|https://github.com/krummas/cassandra/tree/marcuse/11824]|[testall|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-dtest]| |[marcuse/11824-2.2|https://github.com/krummas/cassandra/tree/marcuse/11824-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-2.2-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-2.2-dtest]| |[marcuse/11824-3.0|https://github.com/krummas/cassandra/tree/marcuse/11824-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-3.0-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-3.0-dtest]| |[marcuse/11824-3.7|https://github.com/krummas/cassandra/tree/marcuse/11824-3.7]|[testall|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-3.7-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-3.7-dtest]| |[marcuse/11824-trunk|https://github.com/krummas/cassandra/tree/marcuse/11824-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-trunk-dtest]| should also note that this does not seem to fix CASSANDRA-11728 could you review [~yukim]? > If repair fails no way to run repair again > ------------------------------------------ > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug > Reporter: T Jake Luciani > Assignee: Marcus Eriksson > Labels: fallout > Fix For: 3.0.x > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. > List of failed endpoint(s): [172.31.24.76, 172.31.17.32] > [2016-05-17 16:57:21,945] null > {quote} > Subsequent calls to repair with all nodes up still fails: > {quote} > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 > CompactionManager.java:1193 - Cannot start multiple repair sessions over the > same sstables > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - > Failed creating a merkle tree for [repair > #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)