[ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054911#comment-16054911
 ] 

ASF GitHub Bot commented on CASSANDRA-13480:
--------------------------------------------

GitHub user Jollyplum opened a pull request:

    https://github.com/apache/cassandra/pull/122

    When lost notifications occur and periodically, check for the parent …

    …repair status and exit if we've completed/failed
    
    patch by Matt Byrd, reviewed by Chris Lohfink for CASSANDRA-13480

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Jollyplum/cassandra 13480

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/cassandra/pull/122.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #122
    
----
commit b4c0a5a65ef94b1013793adb088ca11f563ff14b
Author: Matt Byrd <matthew.l.b...@gmail.com>
Date:   2017-05-27T00:03:17Z

    When lost notifications occur and periodically, check for the parent repair 
status and exit if we've completed/failed
    patch by Matt Byrd, reviewed by Chris Lohfink for CASSANDRA-13480

----


> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> ----------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13480
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Matt Byrd
>            Assignee: Matt Byrd
>            Priority: Minor
>             Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to