Ok I did following poc real quick.
1. Two nodes, started. And joined. Topology snapshot servers=2. 2. In one node, I blocked the Ignite ports(47500, 47100 etc). 3. Then After failureDetecitonTimeout, it logged NODE_FAILED, and Topology snapshot servers=1 in each node. 4. Then after 10-15 seconds, I unblock those ports. 5. Then after few seconds, both nodes logged, Node joined, and topology snapshot server=2 So it’s the same node, ID, because JVM is still up and running. And looks like it doesn’t forget. Can this “10-15 seconds” be any time ? Even in 1-2 hours if the node comes back, can it rejoin ? From: Evgenii Zhuravlev [mailto:e.zhuravlev...@gmail.com] Sent: Monday, July 02, 2018 1:25 PM To: user@ignite.apache.org Subject: Re: How long Ignite retries upon NODE_FAILED events If cluster already decided that node failed, it will be stopped after it will try to reconnect to the cluster with the same id 2018-07-02 18:37 GMT+03:00 HEWA WIDANA GAMAGE, SUBASH <subash.hewawidanagam...@fmr.com<mailto:subash.hewawidanagam...@fmr.com>>: Yes failureDetectionTimeout determines the time it wait to mark a node failed. But my question is, after such node failed happened, and then what happens when that failed node becomes reachable in the network (less that failureDetectionTimeout) ? From: Evgenii Zhuravlev [mailto:e.zhuravlev...@gmail.com<mailto:e.zhuravlev...@gmail.com>] Sent: Monday, July 02, 2018 11:05 AM To: user@ignite.apache.org<mailto:user@ignite.apache.org> Subject: Re: How long Ignite retries upon NODE_FAILED events Hi, by default, Ignite uses a mechanism, that can be configured using failureDetectionTimeout: https://apacheignite.readme.io/v2.5/docs/tcpip-discovery#section-failure-detection-timeout Evgenii 2018-07-02 16:40 GMT+03:00 HEWA WIDANA GAMAGE, SUBASH <subash.hewawidanagam...@fmr.com<mailto:subash.hewawidanagam...@fmr.com>>: Hi team, For example, let’s say one of the node is not down(JVM is up), but network not reachable from/to it. Then rest of the nodes will see NODE_FAILED and started working as normal with reduced cluster size. If that failed node, the network from/to it, becomes normal again after X minutes. Then, - will other nodes discover them, or will that node be able to figure it out ? - How long X can be at max? Is there max retry or timeout. (I seen joinTimeout param in discovery, but that’s seems only applicable for startup, like how long it should pause starting the node to let join others)