Ok I did following poc real quick.

1.       Two nodes, started. And joined. Topology snapshot servers=2.

2.       In one node, I blocked the Ignite ports(47500, 47100 etc).

3.       Then After failureDetecitonTimeout,  it logged NODE_FAILED, and 
Topology snapshot servers=1 in each node.

4.       Then after 10-15 seconds, I unblock those ports.

5.       Then after few seconds, both nodes logged, Node joined, and topology 
snapshot server=2

So it’s the same node, ID, because JVM is still up and running. And looks like 
it doesn’t forget.

Can this “10-15 seconds” be any time ? Even in 1-2 hours if the node comes 
back, can it rejoin ?




From: Evgenii Zhuravlev [mailto:e.zhuravlev...@gmail.com]
Sent: Monday, July 02, 2018 1:25 PM
To: user@ignite.apache.org
Subject: Re: How long Ignite retries upon NODE_FAILED events

If cluster already decided that node failed, it will be stopped after it will 
try to reconnect to the cluster with the same id

2018-07-02 18:37 GMT+03:00 HEWA WIDANA GAMAGE, SUBASH 
<subash.hewawidanagam...@fmr.com<mailto:subash.hewawidanagam...@fmr.com>>:
Yes failureDetectionTimeout determines the time it wait to mark a node failed. 
But my question is, after such node failed happened, and then what happens when 
that failed node becomes reachable in the network (less that 
failureDetectionTimeout) ?

From: Evgenii Zhuravlev 
[mailto:e.zhuravlev...@gmail.com<mailto:e.zhuravlev...@gmail.com>]
Sent: Monday, July 02, 2018 11:05 AM
To: user@ignite.apache.org<mailto:user@ignite.apache.org>
Subject: Re: How long Ignite retries upon NODE_FAILED events

Hi,

by default, Ignite uses a mechanism, that can be configured using 
failureDetectionTimeout: 
https://apacheignite.readme.io/v2.5/docs/tcpip-discovery#section-failure-detection-timeout

Evgenii

2018-07-02 16:40 GMT+03:00 HEWA WIDANA GAMAGE, SUBASH 
<subash.hewawidanagam...@fmr.com<mailto:subash.hewawidanagam...@fmr.com>>:
Hi team,

For example, let’s say one of the node is not down(JVM is up), but network not 
reachable from/to it. Then rest of the nodes will see  NODE_FAILED and started 
working as normal with reduced cluster size. If that failed node, the network 
from/to it, becomes normal again  after X minutes. Then,
- will other nodes discover them, or will that node be able to figure it out ?
- How long X can be at max? Is there max retry or timeout. (I seen joinTimeout 
param in discovery, but that’s seems only applicable for startup, like how long 
it should pause starting the node to let join others)


Reply via email to