[ 
https://issues.apache.org/jira/browse/CASSANDRA-16381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286125#comment-17286125
 ] 

Adam Holmberg commented on CASSANDRA-16381:
-------------------------------------------

I've been looking at this, but don't let that deter anyone else who wants to 
have a go. I'm not previously knowledgable in these areas of the code.

bq. This will cause streaming errors and then the hosts involved in streaming 
will also lose network connectivity to each other and begin dropping gossip syn 
messages.

That doesn't match exactly what I've seen here on trunk. We should see if we 
can corroborate. I'll describe my observations here:

When running Brandon's simple test, intermittently (but consistently) we see 
removenode hang. When in this state, the remaining servers are still responsive 
to client traffic, but the nodetool command will hang forever (until the test 
times out). The remaining nodes continue attempting to reconnect to the node 
intended for removal. Meanwhile a previously queued SYN message to that node 
expires. I think the reconnect and message expiry are non-issues, assuming that 
would stop if the node removal completes. Will have to get back to that after 
figuring out the primary issue. On to that...

{{removenode}} is hanging forever in [this 
loop|https://github.com/apache/cassandra/blob/e8a9d4203c81e622fc2418d2faf2593e2123161e/src/java/org/apache/cassandra/service/StorageService.java#L4732-L4735]
 on the coordinating node1 because it is never receiving the final 
{{REPLICATION_DONE_REQ}} message sent by node2. Both nodes appear to complete 
their streaming sessions without error and send that message. We never see the 
one from node2 arrive at node1. At the same time, I'm seeing gossip diverge 
between node1. node2. They eventually convict each other. The node2 streaming 
session complete message [times 
out|https://github.com/apache/cassandra/blob/e8a9d4203c81e622fc2418d2faf2593e2123161e/src/java/org/apache/cassandra/service/StorageService.java#L3027-L3028]
 having never seen a response. The loop is exited because node1 is not "alive" 
immediately after the conviction.

Current avenues of investigation:
1.) Look into why the replication done request is not being sent/received. 
Presently I think I see it being enqueued, but never serialized and sent.
2.) Understand gossip diverging at the end of the streaming session.

> nodetool removenode error “Conflicting replica added”
> -----------------------------------------------------
>
>                 Key: CASSANDRA-16381
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16381
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Bootstrap and Decommission
>            Reporter: vincent royer
>            Assignee: Adam Holmberg
>            Priority: Normal
>             Fix For: 4.0-beta
>
>         Attachments: dtest.tar.bz2, node1.tar.bz2, node2.tar.bz2, 
> node3.tar.bz2
>
>
> When testing elassandra on C* 4.0, integration tests with ccm systematically 
> failed on removing a node with the following error “Conflicting replica 
> added” . [This integration test 
> |https://github.com/strapdata/elassandra/blob/v6.8.4-strapdata/integ-test/test-cleanup-repair.sh#L289]
>  was ok with Elassandra based on Cassandra 3.11, and there is no changes in 
> that test. Moreover, it seems there is no cassandra-test (dtest) for removing 
> a node (there is only one removenode test for transient replication). The 
> topology_test.py remove a node from the CCM cluster, but it does not call 
> nodetool removenode.
> I wonder if we have a non-tested regression here in C 4.0 ?
> ++ ccm node1 nodetool status
> ++ awk ‘/127.0.0.3/ \{ print $7 }’
> + HOST_ID3=6d2e858f-dacc-4c7c-a626-14b45f6b3b94
> + ccm node3 stop
> + ccm node1 nodetool removenode 6d2e858f-dacc-4c7c-a626-14b45f6b3b94
> Traceback (most recent call last):
>   File “/usr/local/bin/ccm”, line 4, in <module>
>     __import__(‘pkg_resources’).run_script(‘ccm==3.1.6’, ‘ccm’)
>   File 
> “/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py”,
>  line 742, in run_script
>     self.require(requires)[0].run_script(script_name, ns)
>   File 
> “/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py”,
>  line 1674, in run_script
>     exec(script_code, namespace, namespace)
>   File 
> “/Library/Python/2.7/site-packages/ccm-3.1.6-py2.7.egg/EGG-INFO/scripts/ccm”, 
> line 112, in <module>  File 
> “build/bdist.macosx-10.14-intel/egg/ccmlib/cmds/node_cmds.py”, line 233, in 
> run
>   File “build/bdist.macosx-10.14-intel/egg/ccmlib/node.py”, line 848, in 
> nodetool
>   File “build/bdist.macosx-10.14-intel/egg/ccmlib/node.py”, line 2131, in 
> handle_external_tool_process
> ccmlib.node.ToolError: Subprocess [‘nodetool’, ‘-h’, ‘localhost’, ‘-p’, 
> ‘7100’, ‘removenode’, ‘6d2e858f-dacc-4c7c-a626-14b45f6b3b94’] exited with 
> non-zero status; exit status: 1;
> stdout: nodetool: Conflicting replica added (expected unique ranges): 
> Full(/127.0.0.1:7000,(4949329179655327935,6135417578204142297]); existing: 
> Full(/127.0.0.1:7000,(4949329179655327935,6135417578204142297])
> See ‘nodetool help’ or ‘nodetool help <command>’.++ finish
> ++ echo ‘ERROR occurs, test failed’
> ERROR occurs, test failed
> ++ ‘[’ ‘!’ -z ‘’ ‘]’
> ++ exit 1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to