[
https://issues.apache.org/jira/browse/IGNITE-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611911#comment-14611911
]
Denis Magda commented on IGNITE-882:
------------------------------------
The issue is found.
We have a half-joined node problem here.
1) Node_A wants to join a cluster and sends a join request;
2) Node_B processes this join request and responds to Node_A;
3) Node_A receives the response from Node_B and after that "node add finished"
message is propagated to the cluster;
4) Node_A receives "node add finished" but this message is not fully processed
by SocketReader (because of scheduling) and Node_A's Thread, that is
responsible for joining, sends one more join request when netTimeout expires;
5) In "parallel" with 4) it's decided that Node_A left the ring (cause there
were no response from it during some timeout) and Node_A is removed from the
ring;
6) Node_B/C/whatever receives the second join request from Node_A and during
this attempt the luck is on Node_A's side, it's added to the topology but with
the same ID as before.
Thinking over a fix.
> Node can join twice with the same ID
> ------------------------------------
>
> Key: IGNITE-882
> URL: https://issues.apache.org/jira/browse/IGNITE-882
> Project: Ignite
> Issue Type: Bug
> Components: general
> Reporter: Semen Boikov
> Assignee: Denis Magda
> Priority: Critical
> Fix For: sprint-7
>
>
> Observed in the test
> 'GridCacheColocatedFailoverSelfTest.testOptimisticRepeatableReadTxConstantTopologyChange':
> Node joined:
> {noformat}
> [15:53:24,163][INFO
> ][disco-event-worker-#121%dht.GridCacheColocatedFailoverSelfTest0%][GridDiscoveryManager]
> Added new node to topology: TcpDiscoveryNode
> [id=10cf7906-50af-4f46-9c31-baf419539001, addrs=[127.0.0.1],
> sockAddrs=[/127.0.0.1:47525], discPort=47525, order=400, intOrder=202,
> loc=false, ver=1.0.3#19700101-sha1:00000000, isClient=false]
> {noformat}
> Node failed:
> {noformat}
> [15:53:24,171][WARN
> ][disco-event-worker-#121%dht.GridCacheColocatedFailoverSelfTest0%][GridDiscoveryManager]
> Node FAILED: TcpDiscoveryNode [id=10cf7906-50af-4f46-9c31-baf419539001,
> addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47525], discPort=47525, order=400,
> intOrder=202, loc=false, ver=1.0.3#19700101-sha1:00000000, isClient=false]
> {noformat}
> This see this message from the thread starting new node:
> {noformat}
> [15:53:29,047][WARN ][topology-change-thread-1][TcpDiscoverySpi] Node has not
> been connected to topology and will repeat join process. Check remote nodes
> logs for possible error messages. Note that large topology may require
> significant time to start. Increase 'TcpDiscoverySpi.networkTimeout'
> configuration property if getting this message on the starting nodes
> [networkTimeout=5000]
> {noformat}
> Node joined again with the same ID:
> {noformat}
> [15:53:29,212][INFO
> ][disco-event-worker-#121%dht.GridCacheColocatedFailoverSelfTest0%][GridDiscoveryManager]
> Added new node to topology: TcpDiscoveryNode
> [id=10cf7906-50af-4f46-9c31-baf419539001, addrs=[127.0.0.1],
> sockAddrs=[/127.0.0.1:47525], discPort=47525, order=404, intOrder=205,
> loc=false, ver=1.0.3#19700101-sha1:00000000, isClient=false]
> {noformat}
> Then test hangs (in the log I see that future mapped on the node
> '10cf7906-50af-4f46-9c31-baf419539001' did not finish).
> The same issue observed in tests extending
> GridCacheAbstractNodeRestartSelfTest.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)