[jira] [Commented] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2017-10-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202716#comment-16202716
 ] 

ASF GitHub Bot commented on CASSANDRA-13480:


Github user Jollyplum closed the pull request at:

https://github.com/apache/cassandra/pull/122


> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
>  Labels: repair
> Fix For: 4.0
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2017-04-27 Thread Chris Lohfink (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15988094#comment-15988094
 ] 

Chris Lohfink commented on CASSANDRA-13480:
---

how many notifications you see doesnt impact the notification buffer. JMX will 
create a buffer of notifications and cycle through them indexing new events as 
they are created. The JMX client will request events with the last index it has 
seen. Since the server does not store the state of the clients or know what 
they are listening for, ALL events regardless of listening state are appended 
into buffer. Even if nothing is listening to them all the storage 
notifications, the streaming notifications, the jvm hotspot notifications are 
being pushed onto that buffer. If your client takes too long between polling it 
will get lost notifications (and it will tell you how many it lost). 5000 still 
may not be nearly enough, but its gonna cost the heap dearly to make that value 
too large.

Nodetool actually used to just shut down on lost notifications, but in some 
clusters/workloads its almost impossible for a client to keep up. In 
CASSANDRA-7909 it was just starting to be logged. Querying a different endpoint 
wouldnt really help, only the repair coordinator has the events and it doesnt 
keep it around (and its cycled outta buffer). We could in theory pull expose a 
JMX operation that checks the repair_history table to determine if the repair 
has been completed or errored out, and on lost notifications call it to make 
sure we did not miss a complete event.

> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2017-04-28 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989556#comment-15989556
 ] 

Matt Byrd commented on CASSANDRA-13480:
---

So the patch I have currently also caches the notifications for repairs for a 
limited time on the co-ordinator, it was initially targeting a release where we 
didn't yet have the repair history tables.
I suppose there is a concern that caching these notifications could under some 
circumstances cause unwanted extra heap usage. 
(Similarly to the notifications buffer, although at least here we're only 
caching a subset that we care more about)
So using the repair history tables instead and exposing this information by imx 
seems like a reasonable alternative.
There are perhaps a couple of kinks to work out, but I'll have a go at adapting 
the patch that I have to work in this way.
For one we only have the cmd id int sent back to the nodetool process (rather 
than the parent session id which the internal table is partition keyed off)
We could either keep track of the cmd id int -> parent session uuid in the 
co-ordinator, either in memory cached to expire or in another internal table,
or we could parse the uuid out of the notification sent for the start of the 
parent repair.
Parsing the message is a bit brittle though and not full proof in theory (we 
could miss that notification also).
Ideally I suppose running a repair could return and communicate on the basis of 
the parent session uuid rather than the int cmd id, but this is a pretty major 
overhaul and has all sorts of compatibility questions.

> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2018-05-17 Thread mck (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479873#comment-16479873
 ] 

mck commented on CASSANDRA-13480:
-

I can see immediate benefit to users of Reaper with this, so would like to see 
it back ported to 3.11.x

[~jjirsa], what are your thoughts?

Regarding the patch:
 - the only public signature added are:
 -- the method {{List StorageServiceMBean.getParentRepairStatus(int 
cmd)}}
 -- the enum {{ActiveRepairService.ParentRepairStatus}}
 -- the method {{ActiveRepairService.recordRepairStatus(…)}}
 - the addition of the 100k entry cache for ParentRepairStatus events

While it seems to be reasonably safe, if it is back ported I wonder if the 
{{getParentRepairStatus(..)}} method should be marked as {{@Beta}}?

> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
>  Labels: repair
> Fix For: 4.0
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2018-05-18 Thread Jon Haddad (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481212#comment-16481212
 ] 

Jon Haddad commented on CASSANDRA-13480:


Assuming it back ports cleanly I'm +1 on bringing it to 3.11.

> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
>  Labels: repair
> Fix For: 4.0
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2017-06-19 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054909#comment-16054909
 ] 

Matt Byrd commented on CASSANDRA-13480:
---


||Trunk|||
|[branch|https://github.com/Jollyplum/cassandra/tree/13480]|[testall|https://circleci.com/gh/Jollyplum/cassandra/14]|[dtests|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/98/]|

> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2017-06-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054911#comment-16054911
 ] 

ASF GitHub Bot commented on CASSANDRA-13480:


GitHub user Jollyplum opened a pull request:

https://github.com/apache/cassandra/pull/122

When lost notifications occur and periodically, check for the parent …

…repair status and exit if we've completed/failed

patch by Matt Byrd, reviewed by Chris Lohfink for CASSANDRA-13480

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Jollyplum/cassandra 13480

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/cassandra/pull/122.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #122


commit b4c0a5a65ef94b1013793adb088ca11f563ff14b
Author: Matt Byrd 
Date:   2017-05-27T00:03:17Z

When lost notifications occur and periodically, check for the parent repair 
status and exit if we've completed/failed
patch by Matt Byrd, reviewed by Chris Lohfink for CASSANDRA-13480




> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2017-06-29 Thread Chris Lohfink (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068439#comment-16068439
 ] 

Chris Lohfink commented on CASSANDRA-13480:
---

+1

> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
>  Labels: repair
> Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2018-02-27 Thread Tania S Engel (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378919#comment-16378919
 ] 

Tania S Engel commented on CASSANDRA-13480:
---

[~mbyrd] : I have reason to believe I just hit this in 3.11.1, I at the very 
least ran into a repair which has never completed on an 11 node cluster. Is 
there a way to get this fix in 3.11?

> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
>  Labels: repair
> Fix For: 4.0
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org