[ https://issues.apache.org/jira/browse/IGNITE-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749985#comment-16749985 ]
Ignite TC Bot commented on IGNITE-10933: ---------------------------------------- {panel:title=--> Run :: All: Possible Blockers|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1} {color:#d04437}Cache 6{color} [[tests 0 TIMEOUT , Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=2870837]] * GridCachePartitionEvictionDuringReadThroughSelfTest.testPartitionRent (last started) {color:#d04437}MVCC Queries{color} [[tests 3|https://ci.ignite.apache.org/viewLog.html?buildId=2879340]] * IgniteCacheMvccSqlTestSuite: CacheMvccReplicatedSqlTxQueriesTest.testAccountsTxDmlSql_SingleNode_Persistence - 0,0% fails in last 421 master runs. * IgniteCacheMvccSqlTestSuite: CacheMvccPartitionedSqlTxQueriesTest.testAccountsTxDmlSql_WithRemoves_SingleNode_Persistence - 0,0% fails in last 421 master runs. {panel} [TeamCity *--> Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=2870883&buildTypeId=IgniteTests24Java8_RunAll] > Node may hang on join to topology and not move forward > ------------------------------------------------------ > > Key: IGNITE-10933 > URL: https://issues.apache.org/jira/browse/IGNITE-10933 > Project: Ignite > Issue Type: Bug > Reporter: Vladislav Pyatkov > Assignee: Alexei Scherbakov > Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Several nodes join to topology simultaneously and hang on a long time. > That can be on first start all cluster nodes or join nodes to completed > topology. > In the logs of problem nodes can see messages: > {noformat} > 2019-01-11 18:37:39.296 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] > Node has not been connected to topology and will repeat join process. Check > remote nodes logs for possible error messages. Note that large topology may > require sig > nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' > configuration property if getting this message on the starting nodes > [networkTimeout=5000] > 2019-01-11 18:43:09.374 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] > Node has not been connected to topology and will repeat join process. Check > remote nodes logs for possible error messages. Note that large topology may > require sig > nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' > configuration property if getting this message on the starting nodes > [networkTimeout=5000] > ... > {noformat} > and so for a long time without others. > UPDATE: such behavior is caused by transferring > TcpDiscoveryClientReconnectMessage stored in pending objects collection to > joining node causing socket connection invalidation to joining node and > marking it as failed. > Reproduced by the following scenario: > 1. Create topology in specific order: srv1 srv2 client srv3 srv4 > 2. Delay client reconnect. > 3. Trigger topology change by restarting srv2 (will trigger reconnect to next > node), srv3, srv4 > 4. Resume reconnect to node with empty EnsuredMessageHistory (triggering > discovery message of type TcpDiscoveryClientReconnectMessage) and wait for > completion. > 5. Add new node to topology. > New node will fail with assertion or forever will stuck on join depending on > timings. > Same scenario could be probably triggered by temporary connection loss to > joining node. > [~v.pyatkov], thanks for help with the investigation. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)