Ignite timeouts and trouble interpreting the logs

tschauenberg Fri, 30 Oct 2020 14:22:04 -0700

First some background.  Ignite 2.8.1 with a 3 node cluster, two webserver
client nodes, and one batch processing client node that comes and goes.


The two webserver thick client nodes and the one batch processing thick
client node have the following configuration values:
* IgniteConfiguration.setNetworkTimeout(60000)
* IgniteConfiguration.setFailureDetectionTimeout(120000)
* TcpDiscoverySpi.setJoinTimeout(60000)
* TcpCommunicationSpi.setIdleConnectionTimeout(Long.MAX_VALUE)

The server nodes do not have any timeouts set and are currently using all
defaults.  My understanding is that means they are using:
* failureDetectionTimeout 10000
* clientFailureDetectionTimeout 30000
 
Every so often the batch processing client node fails to connect to the
cluster.  We try to connect the batch processing client node to a single
node in the cluster using:
TcpDiscoverySpi.setIpFinder(TcpDiscoveryVmIpFinder().setAddresses(single
node ip)

I see the following stream of logs on the server node the client connects to
and I am hoping you can shed some light into what timeout values I have set
incorrectly what values I need to set instead.

In these logs I have obfuscated the client IP to 10.1.2.xxx and the server
IP as 10.1.10.xxx

----
On the server node that the client tries to connect to I see the following
sequence of messages:

[20:21:28,092][INFO][exchange-worker-#42][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=4146, minorTopVer=0], force=false, evt=NODE_JOINED,
node=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3]

So the client joined the cluster almost at exactly the same time it tried to
join which seems good so far.

Then I see
[20:21:54,726][INFO][db-checkpoint-thread-#56][GridCacheDatabaseSharedManager]
Skipping checkpoint (no pages were modified) [checkpointBeforeLockTime=6ms,
checkpointLockWait=0ms, checkpointListenersExecuteTime=6ms,
checkpointLockHoldTime=8ms, reason='timeout']
[20:21:58,044][INFO][tcp-disco-sock-reader-[1b91b2a5 10.1.2.xxx:47585
client]-#4176][TcpDiscoverySpi] Finished serving remote node connection
[rmtAddr=/10.1.2.xxx:47585, rmtPort=47585

[20:21:58,045][WARNING][grid-timeout-worker-#23][TcpDiscoverySpi] Socket
write has timed out (consider increasing
'IgniteConfiguration.failureDetectionTimeout' configuration property)
[failureDetectionTimeout=10000, rmtAddr=/10.1.2.xxx:47585, rmtPort=47585,
sockTimeout=5000]

I don't understand this socket timeout line because that remote address is
the client remote address so I don't know what it was doing here and this
failureDetectionTimeout isn't the clientFailureDetectionTimeout which I
don't get.

It then seems to connect just fine to the client discovery here

[20:22:10,170][INFO][tcp-disco-srvr-[:47500]-#3][TcpDiscoverySpi] TCP
discovery accepted incoming connection [rmtAddr=/10.1.2.xxx, rmtPort=56921]
[20:22:10,170][INFO][tcp-disco-srvr-[:47500]-#3][TcpDiscoverySpi] TCP
discovery spawning a new thread for connection [rmtAddr=/10.1.2.xxx,
rmtPort=56921]
[20:22:10,171][INFO][tcp-disco-sock-reader-[]-#4178][TcpDiscoverySpi]
Started serving remote node connection [rmtAddr=/10.1.2.xxx:56921,
rmtPort=56921]
[20:22:10,175][INFO][tcp-disco-sock-reader-[1b91b2a5 10.1.2.xxx:56921
client]-#4178][TcpDiscoverySpi] Initialized connection with remote client
node [nodeId=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3,
rmtAddr=/10.1.2.xxx:56921]
[20:22:27,870][INFO][tcp-disco-sock-reader-[1b91b2a5 10.1.2.xxx:56921
client]-#4178][TcpDiscoverySpi] Finished serving remote node connection
[rmtAddr=/10.1.2.xxx:56921, rmtPort=56921

The client hits its timeout at 20:22:28 which is the 60 seconds timeout we
give it from 20:21:28, so this finished message is almost at the exact same
time as the timeout threshold.  

Given that socket timeout above, is the second chunk of logs from
20:22:10-20:22:27 a client discovery retry?  

The client exits at 20:22:28 because of its 60 seconds timeout and probably
didn't get the above discovery response message in time?

This server node then notices the client didn't respond within 30 seconds
from 20:22:27 to 20:22:57 (and since it timed out at 20:22:28 and exited
that generally seems to fit):

[20:22:57,811][WARNING][tcp-disco-msg-worker-[21ddf49c 10.1.10.xxx:47500
crd]-#2][TcpDiscoverySpi] Failing client node due to not receiving metrics
updates from client node within
'IgniteConfiguration.clientFailureDetectionTimeout' (consider increasing
configuration property) [timeout=30000, node=TcpDiscoveryNode
[id=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3,
consistentId=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3, addrs=ArrayList
[127.0.0.1, 172.17.0.3], sockAddrs=HashSet [/127.0.0.1:0, /172.17.0.3:0],
discPort=0, order=4146, intOrder=2076, lastExchangeTime=1604089287814,
loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=true]]
[20:22:57,812][WARNING][disco-event-worker-#41][GridDiscoveryManager] Node
FAILED: TcpDiscoveryNode [id=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3,
consistentId=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3, addrs=ArrayList
[127.0.0.1, 172.17.0.3], sockAddrs=HashSet [/127.0.0.1:0, /172.17.0.3:0],
discPort=0, order=4146, intOrder=2076, lastExchangeTime=1604089287814,
loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=true]
[20:22:57,813][INFO][disco-event-worker-#41][GridDiscoveryManager] Topology
snapshot [ver=4147, locNode=7e275973, servers=3, clients=2, state=ACTIVE,
CPUs=8, offheap=3.0GB, heap=3.7GB]
[20:22:57,813][INFO][disco-event-worker-#41][GridDiscoveryManager]   ^--
Baseline [id=2, size=3, online=3, offline=0]

Nothing interesting on the other ignite nodes as they see the client joining
at [20:21:28,091] and failing at [20:22:57,813]

---
Please help me make sense of these logs and please advise on which timeouts
were the problematic issues in this sequence that I can hopefully increase
to avoid this transient connection problem.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Ignite timeouts and trouble interpreting the logs

Reply via email to