Hi Naresh,

Actually any JVM process hang could lead to segmentation. If some node is
not responsive for longer than failureDetectionTimeout, it will be kicked
off from the cluster to prevent all over grid performance degradation.

It works on following scenario. Let's say we have 3 nodes in a ring: n1 ->
n2 -> n3. Over ring go some discovery messages along with metrics and
connection checks with predefined interval. Node 2 start experiencing issues
like GC pause or OS failures that forces process to stop. For that time node
1 is unable to send message to n2 (it doesn't receive ack). n1 waits for
failureDetectionTimeout and establishes connection to n3: n1 -> n3; when n2
is not connected. 

Cluster treated n2 as failed. When n2 comes back it tries to connect to n3
and send message across ring, when it receives message that it's out of
grid. For n2 that means it was segmented and best what it could do is stop.

To check if there were large JVM or system pauses, you may enable GC logs.
If they longer than failureDetectionTimeout, then node will be segmented.

The best way would be to solve pauses, but like a workaround - increase
timeout.

Thanks!
-Dmitry



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Reply via email to