[ https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054911#comment-16054911 ]
ASF GitHub Bot commented on CASSANDRA-13480: -------------------------------------------- GitHub user Jollyplum opened a pull request: https://github.com/apache/cassandra/pull/122 When lost notifications occur and periodically, check for the parent … …repair status and exit if we've completed/failed patch by Matt Byrd, reviewed by Chris Lohfink for CASSANDRA-13480 You can merge this pull request into a Git repository by running: $ git pull https://github.com/Jollyplum/cassandra 13480 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/cassandra/pull/122.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #122 ---- commit b4c0a5a65ef94b1013793adb088ca11f563ff14b Author: Matt Byrd <matthew.l.b...@gmail.com> Date: 2017-05-27T00:03:17Z When lost notifications occur and periodically, check for the parent repair status and exit if we've completed/failed patch by Matt Byrd, reviewed by Chris Lohfink for CASSANDRA-13480 ---- > nodetool repair can hang forever if we lose the notification for the repair > completing/failing > ---------------------------------------------------------------------------------------------- > > Key: CASSANDRA-13480 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13480 > Project: Cassandra > Issue Type: Bug > Components: Tools > Reporter: Matt Byrd > Assignee: Matt Byrd > Priority: Minor > Fix For: 4.x > > > When a Jmx lost notification occurs, sometimes the lost notification in > question is the notification which let's RepairRunner know that the repair is > finished (ProgressEventType.COMPLETE or even ERROR for that matter). > This results in nodetool process running the repair hanging forever. > I have a test which reproduces the issue here: > https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test > To fix this, If on receiving a notification that notifications have been lost > (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via > Jmx to receive all the relevant notifications we're interested in, we can > replay those we missed and avoid this scenario. > It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself > might be lost and so for good measure I have made RepairRunner poll > periodically to see if there were any notifications that had been sent but we > didn't receive (scoped just to the particular tag for the given repair). > Users who don't use nodetool but go via jmx directly, can still use this new > endpoint and implement similar behaviour in their clients as desired. > I'm also expiring the notifications which have been kept on the server side. > Please let me know if you've any questions or can think of a different > approach, I also tried setting: > JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000" > but this didn't fix the test. I suppose it might help under certain scenarios > but in this test we don't even send that many notifications so I'm not > surprised it doesn't fix it. > It seems like getting lost notifications is always a potential problem with > jmx as far as I can tell. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org