Hi, we are running cluster of 2 Ignite 1.9 servers running on the EC2. The EC2 instances are r4.large, i.e. 16GB of memory each. We use Amazon S3 based discovery for both servers and clients.
We have another EC2 instance (r4.large, 16GB) where our app service is running, and where the Ignite clients live. There are 5 Ignite clients are running there (as we run the app in the docker container using `network_mode: host`), and so there are 5 docker instances with the app are running. We also set the socket timeout for the `TcpDiscoverySpi` as the recommended for the EC2 to 30 seconds for both servers and clients. The problem is that after some period of time we get `Local node failed` error, and the looks like the cluster becomes unstable, as it reports in a loop the new topology version increased constantly, i.e. cascading failure. ``` 2017-10-08 17:04:14.520 WARN 6 --- [tcp-client-disco-msg-worker-#4%st%] o.a.i.spi.discovery.tcp.TcpDiscoverySpi : Local node was dropped from cluster due to network problems, will try to reconnect with new id after 10000ms (reconnect delay can be changed using IGNITE_DISCO_FAILED_CLIENT_RECONNECT_DELAY system property) [newId=85e37c0f-fd44-430f-9247-06f783589523, prevId=48e71e9f-7548-460b-9320-2155be8a30a4, locNode=TcpDiscoveryNode [id=48e71e9f-7548-460b-9320-2155be8a30a4, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.31.29.171], sockAddrs=[ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:0, /0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:0], discPort=0, order=138, intOrder=0, lastExchangeTime=1507193821071, loc=true, ver=1.9.0#20170302-sha1:a8169d0a, isClient=true], nodeInitiatedFail=e5897e87-65e8-4bf8-947e-7b3f244c3458, msg=TcpCommunicationSpi failed to establish connection to node [rmtNode=TcpDiscoveryNode [id=48e71e9f-7548-460b-9320-2155be8a30a4, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.31.29.171], sockAddrs=[ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:0, /0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:0], discPort=0, order=138, intOrder=74, lastExchangeTime=1507392564555, loc=false, ver=1.9.0#20170302-sha1:a8169d0a, isClient=true], errs=class o.a.i.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=48e71e9f-7548-460b-9320-2155be8a30a4, addrs=[ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:47103, ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:47103, /0:0:0:0:0:0:0:1%lo:47103, /127.0.0.1:47103]], connectErrs=[class o.a.i.IgniteCheckedException: Failed to connect to address: ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:47103, class o.a.i.IgniteCheckedException: Failed to connect to address: ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:47103, class o.a.i.IgniteCheckedException: Failed to connect to address: /0:0:0:0:0:0:0:1%lo:47103, class o.a.i.IgniteCheckedException: Failed to connect to address: /127.0.0.1:47103]]] 2017-10-08 17:04:24.888 WARN 6 --- [tcp-client-disco-msg-worker-#4%st%] o.a.i.spi.discovery.tcp.TcpDiscoverySpi : Client node was reconnected after it was already considered failed by the server topology (this could happen after all servers restarted or due to a long network outage between the client and servers). All continuous queries and remote event listeners created by this client will be unsubscribed, consider listening to EVT_CLIENT_NODE_RECONNECTED event to restore them. 2017-10-08 17:04:24.981 INFO 6 --- [disco-event-worker-#23%st%] o.a.i.i.m.d.GridDiscoveryManager : Client node reconnected to topology: TcpDiscoveryNode [id=85e37c0f-fd44-430f-9247-06f783589523, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.31.29.171], sockAddrs=[ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:0, /0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:0], discPort=0, order=188, intOrder=0, lastExchangeTime=1507193821071, loc=true, ver=1.9.0#20170302-sha1:a8169d0a, isClient=true] 2017-10-08 17:04:24.988 INFO 6 --- [disco-event-worker-#23%st%] o.a.i.i.m.d.GridDiscoveryManager : Topology snapshot [ver=188, servers=2, clients=8, CPUs=12, heap=17.0GB] 2017-10-08 17:04:47.264 WARN 6 --- [tcp-client-disco-msg-worker-#4%st%] o.a.i.spi.discovery.tcp.TcpDiscoverySpi : Received EVT_NODE_FAILED event with warning [nodeInitiatedEvt=TcpDiscoveryNode [id=28db9f51-f3a3-42d2-b241-520de1124d77, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.31.22.48], sockAddrs=[ip-172-31-22-48.us-west-2.compute.internal/172.31.22.48:47500, ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1507482264715, loc=false, ver=1.9.0#20170302-sha1:a8169d0a, isClient=false], msg=TcpCommunicationSpi failed to establish connection to node [rmtNode=TcpDiscoveryNode [id=691db97e-1bb0-49d9-aa8c-a5c6114e4842, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.31.29.171], sockAddrs=[ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:0, /0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:0], discPort=0, order=186, intOrder=98, lastExchangeTime=1507466513487, loc=false, ver=1.9.0#20170302-sha1:a8169d0a, isClient=true], errs=class o.a.i.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=691db97e-1bb0-49d9-aa8c-a5c6114e4842, addrs=[ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:47104, ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:47104, /0:0:0:0:0:0:0:1%lo:47104, /127.0.0.1:47104]], connectErrs=[class o.a.i.IgniteCheckedException: Failed to connect to address: ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:47104, class o.a.i.IgniteCheckedException: Failed to connect to address: ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:47104, class o.a.i.IgniteCheckedException: Failed to connect to address: /0:0:0:0:0:0:0:1%lo:47104, class o.a.i.IgniteCheckedException: Failed to connect to address: /127.0.0.1:47104]]] 2017-10-08 17:04:47.274 WARN 6 --- [disco-event-worker-#23%st%] o.a.i.i.m.d.GridDiscoveryManager : Node FAILED: TcpDiscoveryNode [id=691db97e-1bb0-49d9-aa8c-a5c6114e4842, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.31.29.171], sockAddrs=[ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:0, /0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:0], discPort=0, order=186, intOrder=98, lastExchangeTime=1507482264827, loc=false, ver=1.9.0#20170302-sha1:a8169d0a, isClient=true] 2017-10-08 17:04:47.278 INFO 6 --- [disco-event-worker-#23%st%] o.a.i.i.m.d.GridDiscoveryManager : Topology snapshot [ver=189, servers=2, clients=7, CPUs=12, heap=17.0GB] ... ``` What could be the cause of this "Local node was dropped from cluster due to network problems" (and why after it happened the cluster seems unstable) and what are the strategies to resolve it? One thing we plan is to create another EC2 instance and split the Ignite clients now between 2 EC2 instances, but it is good to know the root cause of the problem anyway, as not necessary this split will help. -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
