[jira] [Commented] (IGNITE-10933) Node may hang on join to topology and not move forward

2019-01-23 Thread Ignite TC Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750123#comment-16750123
 ] 

Ignite TC Bot commented on IGNITE-10933:


{panel:title=--> Run :: All: No blockers 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
[TeamCity *--> Run :: All* 
Results|https://ci.ignite.apache.org/viewLog.html?buildId=2870883&buildTypeId=IgniteTests24Java8_RunAll]

> Node may hang on join to topology and not move forward
> --
>
> Key: IGNITE-10933
> URL: https://issues.apache.org/jira/browse/IGNITE-10933
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Assignee: Alexei Scherbakov
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Several nodes join to topology simultaneously and hang on a long time.
> That can be on first start all cluster nodes or join nodes to completed 
> topology.
> In the logs of problem nodes can see messages:
> {noformat}
> 2019-01-11 18:37:39.296 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] 
> Node has not been connected to topology and will repeat join process. Check 
> remote nodes logs for possible error messages. Note that large topology may 
> require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
> configuration property if getting this message on the starting nodes 
> [networkTimeout=5000]
>  2019-01-11 18:43:09.374 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] 
> Node has not been connected to topology and will repeat join process. Check 
> remote nodes logs for possible error messages. Note that large topology may 
> require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
> configuration property if getting this message on the starting nodes 
> [networkTimeout=5000]
> ...
> {noformat}
> and so for a long time without others.
> UPDATE: such behavior is caused by transferring 
> TcpDiscoveryClientReconnectMessage stored in pending objects collection to 
> joining node causing socket connection invalidation to joining node and 
> marking it as failed.
> Reproduced by the following scenario:
> 1. Create topology in specific order: srv1 srv2 client srv3 srv4
> 2. Delay client reconnect.
> 3. Trigger topology change by restarting srv2 (will trigger reconnect to next 
> node), srv3, srv4
> 4. Resume reconnect to node with empty EnsuredMessageHistory (triggering 
> discovery message of type TcpDiscoveryClientReconnectMessage) and wait for 
> completion.
> 5. Add new node to topology.
> New node will fail with assertion or forever will stuck on join depending on 
> timings.
> Same scenario could be probably triggered by temporary connection loss to 
> joining node.
> [~v.pyatkov], thanks for help with the investigation.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-10933) Node may hang on join to topology and not move forward

2019-01-23 Thread Ignite TC Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750122#comment-16750122
 ] 

Ignite TC Bot commented on IGNITE-10933:


{panel:title=--> Run :: All: No blockers 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
[TeamCity *--> Run :: All* 
Results|https://ci.ignite.apache.org/viewLog.html?buildId=2870883&buildTypeId=IgniteTests24Java8_RunAll]

> Node may hang on join to topology and not move forward
> --
>
> Key: IGNITE-10933
> URL: https://issues.apache.org/jira/browse/IGNITE-10933
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Assignee: Alexei Scherbakov
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Several nodes join to topology simultaneously and hang on a long time.
> That can be on first start all cluster nodes or join nodes to completed 
> topology.
> In the logs of problem nodes can see messages:
> {noformat}
> 2019-01-11 18:37:39.296 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] 
> Node has not been connected to topology and will repeat join process. Check 
> remote nodes logs for possible error messages. Note that large topology may 
> require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
> configuration property if getting this message on the starting nodes 
> [networkTimeout=5000]
>  2019-01-11 18:43:09.374 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] 
> Node has not been connected to topology and will repeat join process. Check 
> remote nodes logs for possible error messages. Note that large topology may 
> require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
> configuration property if getting this message on the starting nodes 
> [networkTimeout=5000]
> ...
> {noformat}
> and so for a long time without others.
> UPDATE: such behavior is caused by transferring 
> TcpDiscoveryClientReconnectMessage stored in pending objects collection to 
> joining node causing socket connection invalidation to joining node and 
> marking it as failed.
> Reproduced by the following scenario:
> 1. Create topology in specific order: srv1 srv2 client srv3 srv4
> 2. Delay client reconnect.
> 3. Trigger topology change by restarting srv2 (will trigger reconnect to next 
> node), srv3, srv4
> 4. Resume reconnect to node with empty EnsuredMessageHistory (triggering 
> discovery message of type TcpDiscoveryClientReconnectMessage) and wait for 
> completion.
> 5. Add new node to topology.
> New node will fail with assertion or forever will stuck on join depending on 
> timings.
> Same scenario could be probably triggered by temporary connection loss to 
> joining node.
> [~v.pyatkov], thanks for help with the investigation.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-10933) Node may hang on join to topology and not move forward

2019-01-23 Thread Ignite TC Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749985#comment-16749985
 ] 

Ignite TC Bot commented on IGNITE-10933:


{panel:title=--> Run :: All: Possible 
Blockers|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1}
{color:#d04437}Cache 6{color} [[tests 0 TIMEOUT , Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=2870837]]
* GridCachePartitionEvictionDuringReadThroughSelfTest.testPartitionRent (last 
started)

{color:#d04437}MVCC Queries{color} [[tests 
3|https://ci.ignite.apache.org/viewLog.html?buildId=2879340]]
* IgniteCacheMvccSqlTestSuite: 
CacheMvccReplicatedSqlTxQueriesTest.testAccountsTxDmlSql_SingleNode_Persistence 
- 0,0% fails in last 421 master runs.
* IgniteCacheMvccSqlTestSuite: 
CacheMvccPartitionedSqlTxQueriesTest.testAccountsTxDmlSql_WithRemoves_SingleNode_Persistence
 - 0,0% fails in last 421 master runs.

{panel}
[TeamCity *--> Run :: All* 
Results|https://ci.ignite.apache.org/viewLog.html?buildId=2870883&buildTypeId=IgniteTests24Java8_RunAll]

> Node may hang on join to topology and not move forward
> --
>
> Key: IGNITE-10933
> URL: https://issues.apache.org/jira/browse/IGNITE-10933
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Assignee: Alexei Scherbakov
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Several nodes join to topology simultaneously and hang on a long time.
> That can be on first start all cluster nodes or join nodes to completed 
> topology.
> In the logs of problem nodes can see messages:
> {noformat}
> 2019-01-11 18:37:39.296 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] 
> Node has not been connected to topology and will repeat join process. Check 
> remote nodes logs for possible error messages. Note that large topology may 
> require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
> configuration property if getting this message on the starting nodes 
> [networkTimeout=5000]
>  2019-01-11 18:43:09.374 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] 
> Node has not been connected to topology and will repeat join process. Check 
> remote nodes logs for possible error messages. Note that large topology may 
> require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
> configuration property if getting this message on the starting nodes 
> [networkTimeout=5000]
> ...
> {noformat}
> and so for a long time without others.
> UPDATE: such behavior is caused by transferring 
> TcpDiscoveryClientReconnectMessage stored in pending objects collection to 
> joining node causing socket connection invalidation to joining node and 
> marking it as failed.
> Reproduced by the following scenario:
> 1. Create topology in specific order: srv1 srv2 client srv3 srv4
> 2. Delay client reconnect.
> 3. Trigger topology change by restarting srv2 (will trigger reconnect to next 
> node), srv3, srv4
> 4. Resume reconnect to node with empty EnsuredMessageHistory (triggering 
> discovery message of type TcpDiscoveryClientReconnectMessage) and wait for 
> completion.
> 5. Add new node to topology.
> New node will fail with assertion or forever will stuck on join depending on 
> timings.
> Same scenario could be probably triggered by temporary connection loss to 
> joining node.
> [~v.pyatkov], thanks for help with the investigation.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-10933) Node may hang on join to topology and not move forward

2019-01-23 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749936#comment-16749936
 ] 

Alexei Scherbakov commented on IGNITE-10933:


TC run looks ok (failing tests doesn't look related to fix), ready to merge.

> Node may hang on join to topology and not move forward
> --
>
> Key: IGNITE-10933
> URL: https://issues.apache.org/jira/browse/IGNITE-10933
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Assignee: Alexei Scherbakov
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Several nodes join to topology simultaneously and hang on a long time.
> That can be on first start all cluster nodes or join nodes to completed 
> topology.
> In the logs of problem nodes can see messages:
> {noformat}
> 2019-01-11 18:37:39.296 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] 
> Node has not been connected to topology and will repeat join process. Check 
> remote nodes logs for possible error messages. Note that large topology may 
> require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
> configuration property if getting this message on the starting nodes 
> [networkTimeout=5000]
>  2019-01-11 18:43:09.374 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] 
> Node has not been connected to topology and will repeat join process. Check 
> remote nodes logs for possible error messages. Note that large topology may 
> require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
> configuration property if getting this message on the starting nodes 
> [networkTimeout=5000]
> ...
> {noformat}
> and so for a long time without others.
> UPDATE: such behavior is caused by transferring 
> TcpDiscoveryClientReconnectMessage stored in pending objects collection to 
> joining node causing socket connection invalidation to joining node and 
> marking it as failed.
> Reproduced by the following scenario:
> 1. Create topology in specific order: srv1 srv2 client srv3 srv4
> 2. Delay client reconnect.
> 3. Trigger topology change by restarting srv2 (will trigger reconnect to next 
> node), srv3, srv4
> 4. Resume reconnect to node with empty EnsuredMessageHistory (triggering 
> discovery message of type TcpDiscoveryClientReconnectMessage) and wait for 
> completion.
> 5. Add new node to topology.
> New node will fail with assertion or forever will stuck on join depending on 
> timings.
> Same scenario could be probably triggered by temporary connection loss to 
> joining node.
> [~v.pyatkov], thanks for help with the investigation.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-10933) Node may hang on join to topology and not move forward

2019-01-22 Thread Yakov Zhdanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748797#comment-16748797
 ] 

Yakov Zhdanov commented on IGNITE-10933:


Changes look good to me

> Node may hang on join to topology and not move forward
> --
>
> Key: IGNITE-10933
> URL: https://issues.apache.org/jira/browse/IGNITE-10933
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Assignee: Alexei Scherbakov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Several nodes join to topology simultaneously and hang on a long time.
> That can be on first start all cluster nodes or join nodes to completed 
> topology.
> In the logs of problem nodes can see messages:
> {noformat}
> 2019-01-11 18:37:39.296 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] 
> Node has not been connected to topology and will repeat join process. Check 
> remote nodes logs for possible error messages. Note that large topology may 
> require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
> configuration property if getting this message on the starting nodes 
> [networkTimeout=5000]
>  2019-01-11 18:43:09.374 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] 
> Node has not been connected to topology and will repeat join process. Check 
> remote nodes logs for possible error messages. Note that large topology may 
> require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
> configuration property if getting this message on the starting nodes 
> [networkTimeout=5000]
> ...
> {noformat}
> and so for a long time without others.
> UPDATE: such behavior is caused by transferring 
> TcpDiscoveryClientReconnectMessage stored in pending objects collection to 
> joining node causing socket connection invalidation to joining node and 
> marking it as failed.
> Reproduced by the following scenario:
> 1. Create topology in specific order: srv1 srv2 client srv3 srv4
> 2. Delay client reconnect.
> 3. Trigger topology change by restarting srv2 (will trigger reconnect to next 
> node), srv3, srv4
> 4. Resume reconnect to node with empty EnsuredMessageHistory (triggering 
> discovery message of type TcpDiscoveryClientReconnectMessage) and wait for 
> completion.
> 5. Add new node to topology.
> New node will fail with assertion or forever will stuck on join depending on 
> timings.
> Same scenario could be probably triggered by temporary connection loss to 
> joining node.
> [~v.pyatkov], thanks for help with the investigation.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-10933) Node may hang on join to topology and not move forward

2019-01-18 Thread Alexei Scherbakov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746068#comment-16746068
 ] 

Alexei Scherbakov commented on IGNITE-10933:


[~agoncharuk], please review.

> Node may hang on join to topology and not move forward
> --
>
> Key: IGNITE-10933
> URL: https://issues.apache.org/jira/browse/IGNITE-10933
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Assignee: Alexei Scherbakov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Several nodes join to topology simultaneously and hang on a long time.
> That can be on first start all cluster nodes or join nodes to completed 
> topology.
> In the logs of problem nodes can see messages:
> {noformat}
> 2019-01-11 18:37:39.296 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] 
> Node has not been connected to topology and will repeat join process. Check 
> remote nodes logs for possible error messages. Note that large topology may 
> require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
> configuration property if getting this message on the starting nodes 
> [networkTimeout=5000]
>  2019-01-11 18:43:09.374 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] 
> Node has not been connected to topology and will repeat join process. Check 
> remote nodes logs for possible error messages. Note that large topology may 
> require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
> configuration property if getting this message on the starting nodes 
> [networkTimeout=5000]
> ...
> {noformat}
> and so for a long time without others.
> UPDATE: such behavior is caused by transferring 
> TcpDiscoveryClientReconnectMessage stored in pending objects collection to 
> joining node causing socket connection invalidation to joining node and 
> marking it as failed.
> Reproduced by the following scenario:
> 1. Create topology in specific order: srv1 srv2 client srv3 srv4
> 2. Delay client reconnect.
> 3. Trigger topology change by restarting srv2 (will trigger reconnect to next 
> node), srv3, srv4
> 4. Resume reconnect to node with empty EnsuredMessageHistory (triggering 
> discovery message of type TcpDiscoveryClientReconnectMessage) and wait for 
> completion.
> 5. Add new node to topology.
> New node will fail with assertion or forever will stuck on join depending on 
> timings.
> Same scenario could be probably triggered by temporary connection loss to 
> joining node.
> [~v.pyatkov], thanks for help with the investigation.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-10933) Node may hang on join to topology and not move forward

2019-01-18 Thread Ignite TC Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746045#comment-16746045
 ] 

Ignite TC Bot commented on IGNITE-10933:


{panel:title=--> Run :: All: No blockers 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
[TeamCity *--> Run :: All* 
Results|https://ci.ignite.apache.org/viewLog.html?buildId=2828472&buildTypeId=IgniteTests24Java8_RunAll]

> Node may hang on join to topology and not move forward
> --
>
> Key: IGNITE-10933
> URL: https://issues.apache.org/jira/browse/IGNITE-10933
> Project: Ignite
>  Issue Type: Bug
>Reporter: Vladislav Pyatkov
>Assignee: Alexei Scherbakov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Several nodes join to topology simultaneously and hang on a long time.
> That can be on first start all cluster nodes or join nodes to completed 
> topology.
> In the logs of problem nodes can see messages:
> {noformat}
> 2019-01-11 18:37:39.296 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] 
> Node has not been connected to topology and will repeat join process. Check 
> remote nodes logs for possible error messages. Note that large topology may 
> require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
> configuration property if getting this message on the starting nodes 
> [networkTimeout=5000]
>  2019-01-11 18:43:09.374 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] 
> Node has not been connected to topology and will repeat join process. Check 
> remote nodes logs for possible error messages. Note that large topology may 
> require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
> configuration property if getting this message on the starting nodes 
> [networkTimeout=5000]
> ...
> {noformat}
> and so for a long time without others.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)