[ https://issues.apache.org/jira/browse/CASSANDRA-13327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943870#comment-15943870 ]
Ariel Weisberg commented on CASSANDRA-13327: -------------------------------------------- OK. So to remove the node from {{TokenMetadata.pendingEndpointsFor}} then it must appear as a natural endpoint from {{StorageService.getNaturalEndpoints()}} otherwise writes would never make to that participant at all. I guess I need to look up when that transition occurs now and whether doing that transition sooner makes sense. > Pending endpoints size check for CAS doesn't play nicely with > writes-on-replacement > ----------------------------------------------------------------------------------- > > Key: CASSANDRA-13327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13327 > Project: Cassandra > Issue Type: Bug > Components: Coordination > Reporter: Ariel Weisberg > Assignee: Ariel Weisberg > > Consider this ring: > 127.0.0.1 MR UP JOINING -7301836195843364181 > 127.0.0.2 MR UP NORMAL -7263405479023135948 > 127.0.0.3 MR UP NORMAL -7205759403792793599 > 127.0.0.4 MR DOWN NORMAL -7148113328562451251 > where 127.0.0.1 was bootstrapping for cluster expansion. Note that, due to > the failure of 127.0.0.4, 127.0.0.1 was stuck trying to stream from it and > making no progress. > Then the down node was replaced so we had: > 127.0.0.1 MR UP JOINING -7301836195843364181 > 127.0.0.2 MR UP NORMAL -7263405479023135948 > 127.0.0.3 MR UP NORMAL -7205759403792793599 > 127.0.0.5 MR UP JOINING -7148113328562451251 > It’s confusing in the ring - the first JOINING is a genuine bootstrap, the > second is a replacement. We now had CAS unavailables (but no non-CAS > unvailables). I think it’s because the pending endpoints check thinks that > 127.0.0.5 is gaining a range when it’s just replacing. > The workaround is to kill the stuck JOINING node, but Cassandra shouldn’t > unnecessarily fail these requests. > It also appears like required participants is bumped by 1 during a host > replacement so if the replacing host fails you will get unavailables and > timeouts. > This is related to the check added in CASSANDRA-8346 -- This message was sent by Atlassian JIRA (v6.3.15#6346)