[ 
https://issues.apache.org/jira/browse/CASSANDRA-8260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221835#comment-14221835
 ] 

Jason Brown edited comment on CASSANDRA-8260 at 11/22/14 1:50 PM:
------------------------------------------------------------------

+ 1 on the patch, with one small nit: rename the second parameter on the 
overloaded quarantineEndpoint() from quarantineExpiration to quarantineStart 
(or something similar). The reason being is that the timestamp indicates when 
the endpoint is put into quarantine, not when it should expire.

This is a reasonable fix to resolve this timing issue, but I'll add some 
thoughts to CASSANDRA-8304 about cleaning up the peers.



was (Author: jasobrown):
+ 1 on the patch, with one small nit: rename the second parameter on the 
overloaded quarantineEndpoint() from quarantineExpiration to quarantineStart 
(or something similar). The reason being is that the timestamp indicates when 
the endpoint is put into quarantine, not when it should expire.

This is a reasonable fix to resolve this timing issue, but I'll add some 
thoughts to #8304 about cleaning up the peers.


> Replacing a node can leave the old node in system.peers on the replacement
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8260
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8260
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>             Fix For: 2.0.12
>
>         Attachments: 8260.txt
>
>
> Here's what happens:
> Nodes: X, Y, Z. Z replaces Y which is dead.
> 0. Replacement finishes
> 1. Z removes Y, quarantines and evicts (that is, removes the state)
> 2. X sees the replacement, quarantines, but keeps state
> 3. 60s elapses
> 4. quarantine on Z expires
> 5. X sends syn to Z, repopulates Y endpoint state and persists to 
> system.peers, but Z sees the conflict and does not update tMD for Y. 
> 6. FatClient timer on Z starts counting.
> 7. quarantine on X expires, fat client has been idle, evicts and 
> re-quarantines
> 8. 30s elapses
> 9. Fat client timeout occurs on Z, evicts and re-quarantines
> 10. 30s elapses
> 11. quarantine on X expires, so it never gets repopulated with Y since Z 
> already removed it
> It's important to note here that there is a small but relevant gap between 
> steps 1 and 2, which then correlates to steps 4 and 5, and step 5 is where 
> the problem occurs. This also explains why it looks related to RING_DELAY, 
> since the quarantine is RING_DELAY * 2, but Y never quarantines and the fat 
> client timeout is RING_DELAY, effectively making the discrepancy near equal 
> to RING_DELAY in the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to