[ https://issues.apache.org/jira/browse/CASSANDRA-6961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960455#comment-13960455 ]
Tyler Hobbs edited comment on CASSANDRA-6961 at 4/4/14 9:49 PM: ---------------------------------------------------------------- I'm seeing some issues with repair while one node is running with join_ring=false. Here's what I did: * Start a three node ccm cluster * Start a stress write with RF=3 * Stop node3 * Start node3 with join_ring=false * Run a repair against node3 It looks like the repair finishes everything diffing and streaming, but the repair command hangs, and netstats shows continuously increasing completed Command/Response counts. was (Author: thobbs): I'm seeing some issues with repair while one node is running with join_ring=false. Here's what I did: * Start a three node ccm cluster * Start a stress write with RF=3 * Stop node3 * Start node3 * Run a repair against node3 It looks like the repair finishes everything diffing and streaming, but the repair command hangs, and netstats shows continuously increasing completed Command/Response counts. > nodes should go into hibernate when join_ring is false > ------------------------------------------------------ > > Key: CASSANDRA-6961 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6961 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Brandon Williams > Assignee: Brandon Williams > Fix For: 2.0.7 > > Attachments: 6961.txt > > > The impetus here is this: a node that was down for some period and comes back > can serve stale information. We know from CASSANDRA-768 that we can't just > wait for hints, and know that tangentially related CASSANDRA-3569 prevents us > from having the node in a down (from the FD's POV) state handle streaming. > We can *almost* set join_ring to false, then repair, and then join the ring > to narrow the window (actually, you can do this and everything succeeds > because the node doesn't know it's a member yet, which is probably a bit of a > bug.) If instead we modified this to put the node in hibernate, like > replace_address does, it could work almost like replace, except you could run > a repair (manually) while in the hibernate state, and then flip to normal > when it's done. > This won't prevent the staleness 100%, but it will greatly reduce the chance > if the node has been down a significant amount of time. -- This message was sent by Atlassian JIRA (v6.2#6252)