TCP behaves really poorly in the face of significant packet loss. You can look
into tcp_retries1 and tcp_retries2 [1] for some explanations and tuning.
Eventually, TCP will give up attempting to deliver a packet but this may take
up to 30min depending on configuration. IIRC, it’s only at that point that the
socket signals an error to the JVM. On top of TCP, you can layer application
protocols for liveness including timeouts, request/reply semantics, and
periodic messaging.
I would expect that geode should:
1) Not lose any batch events even if packets get dropped
2) Recover quickly when the network becomes stable again
When a batch is sent to a remote site, it is not dequeued from the sender until
the destination site sends a response that the batch was delivered without
error [2].
Note also that the log message below does not strictly indicate a hang, it
could be just making progress slowly.
HTH and looking forward to the results of your investigations.
Anthony
[1] https://linux.die.net/man/7/tcp
[2] There is a corner case if the destination is over the critical threshold
On Apr 28, 2020, at 6:25 AM, Mario Kevo
mailto:mario.k...@est.tech>> wrote:
Hi geode-dev,
I have a question about how Geode handle when some packets from batch is
dropped.
I create Geode WAN with two sites and established replication between them.
Also modified iptables to drop all packets that comes to receiver port.
In that case I have that some threads are stucked. Seems like gw sender never
received any response back.
[warn 2020/04/27 13:19:04.667 CEST tid=0x11] Thread 128 (0x80)
is stuck
[warn 2020/04/27 13:19:04.669 CEST tid=0x11] Thread <128>
(0x80) that was executed at <27 Apr 2020 13:18:13 CEST> has been stuck for
<50.997 seconds> and number of thread monitor iteration <1>
Thread Name state
Executor Group
Monitored metric
Thread stack:
java.net.PlainSocketImpl.socketConnect(Native Method)
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
java.net.Socket.connect(Socket.java:607)
org.apache.geode.distributed.internal.tcpserver.AdvancedSocketCreatorImpl.connect(AdvancedSocketCreatorImpl.java:102)
org.apache.geode.internal.net.SCAdvancedSocketCreator.connect(SCAdvancedSocketCreator.java:51)
org.apache.geode.distributed.internal.tcpserver.TcpSocketCreatorImpl.connect(TcpSocketCreatorImpl.java:59)
org.apache.geode.distributed.internal.tcpserver.ClientSocketCreatorImpl.connect(ClientSocketCreatorImpl.java:54)
org.apache.geode.cache.client.internal.ConnectionImpl.connect(ConnectionImpl.java:94)
org.apache.geode.cache.client.internal.ConnectionConnector.connectClientToServer(ConnectionConnector.java:75)
org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:118)
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:206)
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.forceCreateConnection(ConnectionManagerImpl.java:216)
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.borrowConnection(ConnectionManagerImpl.java:326)
org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:329)
org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:303)
org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:839)
org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:36)
org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90)
org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1329)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:276)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
Also, I tried to run the same test with 200K entries and drop 70% of packets
and see that exception is again there and it takes approx. 40min to transmit
all entries to another site.
How Geode handle dropping some packets from the batch? Does anyone made some
tests on this behavior?
Thanks,
Mario