hi All, I just encountered a situation in my k8s cluster where I'm running a 3 node ignite setup, with 2 client nodes. The server nodes have 8GB of off-heap per node, 8GB JVM (with g1gc) and 4GB of OS memory without persistence. I'm using Ignite 2.7.
One of the ignite nodes got killed due to some issue in the cluster. I believe this was the sequence of events: -> Data Eviction spikes on two nodes in the cluster (NODE A & B), then 15 mins later.. -> NODE C goes down -> NODE D comes up (to replace node C) --> NODE D attempts a PME --> NODE B log = "Local node has detected failed nodes and started cluster-wide procedure" --> During PME the Ignite JVM on NODE D is restarted since it was taking too long and was killed by a k8s liveness probe. --> NODE D comes back up and attempts another PME ---> Note: i see these messages from all the nodes "First 10 pending exchange futures [total=2]" The total keeps ascending. The highest number I see is total=14. ---> NODE D log = "Failed to wait for initial partition map exchange. Possible reasons are:..." ---> NODE B log = "Possible starvation in striped pool. queue=[], dealock = false, Completed: 991189487 ..." ---> NODE A log = "Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=c5a92006-c29a-4a37-b149-7ec7855dc401, clientFailureDetectionTimeout=30000]" NOTE that NODE D kept restarting due to a k8s liveness probe. I think I'm going to remove the probe or make it much more relaxed. During this time the ignite cluster is completely frozen. Restarting NODE D and replacing it with NODE E did not solve the issue. The only way I could solve the problem is to restart NODE B. Any idea why this could have occurred or what I can do to prevent it in the future? I do see this from the failureHandler: "FailureContext [type=CRITICAL_ERROR, err=class org.apache.ignite.IgniteException: Failed to create string representation of binary object.]" but not sure if this is something that would have caused the cluster to seize up. Overall nodes go down in this environment and come back all the time without issues. But I've seen problem occur twice in the last few months. I have logs & thread dumps for all the nodes in the system so if you want me to check anything in particular let me know. thanks, Scott -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/