First some background. Ignite 2.8.1 with a 3 node cluster, two webserver client nodes, and one batch processing client node that comes and goes.
The two webserver thick client nodes and the one batch processing thick client node have the following configuration values: * IgniteConfiguration.setNetworkTimeout(60000) * IgniteConfiguration.setFailureDetectionTimeout(120000) * TcpDiscoverySpi.setJoinTimeout(60000) * TcpCommunicationSpi.setIdleConnectionTimeout(Long.MAX_VALUE) The server nodes do not have any timeouts set and are currently using all defaults. My understanding is that means they are using: * failureDetectionTimeout 10000 * clientFailureDetectionTimeout 30000 Every so often the batch processing client node fails to connect to the cluster. We try to connect the batch processing client node to a single node in the cluster using: TcpDiscoverySpi.setIpFinder(TcpDiscoveryVmIpFinder().setAddresses(single node ip) I see the following stream of logs on the server node the client connects to and I am hoping you can shed some light into what timeout values I have set incorrectly what values I need to set instead. In these logs I have obfuscated the client IP to 10.1.2.xxx and the server IP as 10.1.10.xxx ---- On the server node that the client tries to connect to I see the following sequence of messages: [20:21:28,092][INFO][exchange-worker-#42][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=4146, minorTopVer=0], force=false, evt=NODE_JOINED, node=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3] So the client joined the cluster almost at exactly the same time it tried to join which seems good so far. Then I see [20:21:54,726][INFO][db-checkpoint-thread-#56][GridCacheDatabaseSharedManager] Skipping checkpoint (no pages were modified) [checkpointBeforeLockTime=6ms, checkpointLockWait=0ms, checkpointListenersExecuteTime=6ms, checkpointLockHoldTime=8ms, reason='timeout'] [20:21:58,044][INFO][tcp-disco-sock-reader-[1b91b2a5 10.1.2.xxx:47585 client]-#4176][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/10.1.2.xxx:47585, rmtPort=47585 [20:21:58,045][WARNING][grid-timeout-worker-#23][TcpDiscoverySpi] Socket write has timed out (consider increasing 'IgniteConfiguration.failureDetectionTimeout' configuration property) [failureDetectionTimeout=10000, rmtAddr=/10.1.2.xxx:47585, rmtPort=47585, sockTimeout=5000] I don't understand this socket timeout line because that remote address is the client remote address so I don't know what it was doing here and this failureDetectionTimeout isn't the clientFailureDetectionTimeout which I don't get. It then seems to connect just fine to the client discovery here [20:22:10,170][INFO][tcp-disco-srvr-[:47500]-#3][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/10.1.2.xxx, rmtPort=56921] [20:22:10,170][INFO][tcp-disco-srvr-[:47500]-#3][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/10.1.2.xxx, rmtPort=56921] [20:22:10,171][INFO][tcp-disco-sock-reader-[]-#4178][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/10.1.2.xxx:56921, rmtPort=56921] [20:22:10,175][INFO][tcp-disco-sock-reader-[1b91b2a5 10.1.2.xxx:56921 client]-#4178][TcpDiscoverySpi] Initialized connection with remote client node [nodeId=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3, rmtAddr=/10.1.2.xxx:56921] [20:22:27,870][INFO][tcp-disco-sock-reader-[1b91b2a5 10.1.2.xxx:56921 client]-#4178][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/10.1.2.xxx:56921, rmtPort=56921 The client hits its timeout at 20:22:28 which is the 60 seconds timeout we give it from 20:21:28, so this finished message is almost at the exact same time as the timeout threshold. Given that socket timeout above, is the second chunk of logs from 20:22:10-20:22:27 a client discovery retry? The client exits at 20:22:28 because of its 60 seconds timeout and probably didn't get the above discovery response message in time? This server node then notices the client didn't respond within 30 seconds from 20:22:27 to 20:22:57 (and since it timed out at 20:22:28 and exited that generally seems to fit): [20:22:57,811][WARNING][tcp-disco-msg-worker-[21ddf49c 10.1.10.xxx:47500 crd]-#2][TcpDiscoverySpi] Failing client node due to not receiving metrics updates from client node within 'IgniteConfiguration.clientFailureDetectionTimeout' (consider increasing configuration property) [timeout=30000, node=TcpDiscoveryNode [id=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3, consistentId=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3, addrs=ArrayList [127.0.0.1, 172.17.0.3], sockAddrs=HashSet [/127.0.0.1:0, /172.17.0.3:0], discPort=0, order=4146, intOrder=2076, lastExchangeTime=1604089287814, loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=true]] [20:22:57,812][WARNING][disco-event-worker-#41][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3, consistentId=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3, addrs=ArrayList [127.0.0.1, 172.17.0.3], sockAddrs=HashSet [/127.0.0.1:0, /172.17.0.3:0], discPort=0, order=4146, intOrder=2076, lastExchangeTime=1604089287814, loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=true] [20:22:57,813][INFO][disco-event-worker-#41][GridDiscoveryManager] Topology snapshot [ver=4147, locNode=7e275973, servers=3, clients=2, state=ACTIVE, CPUs=8, offheap=3.0GB, heap=3.7GB] [20:22:57,813][INFO][disco-event-worker-#41][GridDiscoveryManager] ^-- Baseline [id=2, size=3, online=3, offline=0] Nothing interesting on the other ignite nodes as they see the client joining at [20:21:28,091] and failing at [20:22:57,813] --- Please help me make sense of these logs and please advise on which timeouts were the problematic issues in this sequence that I can hopefully increase to avoid this transient connection problem. -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/