Sergey Korotkov created IGNITE-16843: ----------------------------------------
Summary: Timeout while thin client connection Key: IGNITE-16843 URL: https://issues.apache.org/jira/browse/IGNITE-16843 Project: Ignite Issue Type: Bug Reporter: Sergey Korotkov In usecases with several active thin clients producing noticable load to cluster new thin clients can fail to connect with the *"ClientConnectionException: Channel is closed"* error in the *TcpClientChannel::handshake()* method. On server side warning *"Unable to perform handshake within timeout [timeout=10000"* is logged. The problem can be easily reproduced by several large putAlls invoked in parallel from several or single thin client. Espesially for the TRANSACTIONAL caches. But ATOMIC caches are also affected - the only difference is that for ATOMIC caches more parallelizm factor and larger batches for putAlls are needed. ---- The reason of the problem is a fact that a single queue is used in the ignite node to serve all thin client related requests (queue in the {_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control ones like _handshake_ which are used for connection establishment. Working requests can live indefenitely in a queue (or at least longer then {_}handshake{_}). On the other hand a special watchdog timer is scheduled in the ignite node to check if a _handshake_ request is processed in a timely manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}). By default the 10 seconds timeout is used ({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the client session is closed forcibly. So, if one or several thin clients fill queue with long operations new clients can not connect. ---- The real usecase reveals the problem is as follows. * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap each * One TRANSACTIONAL cache with backups=1 * About 30Gb of data on each node * Several (upto 70-100 at the same time) thin clients loading data using putAlls in 50000 records batches. Client connects, loads 3 batches and disconnects (spark jobs in fact). * Default handshakeTimeout (10 secs) and threadPoolSize(8) in ClientConnectorConfiguration ---- Two ducktests were created to reproduce and isolate the problem. *{color:#FF0000}Note that asserts in testcases done in a way that test PASS means that problem IS reproduced.{color}* Tests check that at least one thin client fails with the "Channel is closed" error and that server node log contains warning about the handshake timeout. *ThinClientConcurrencyTest::test_many_thin_clients* Mimics the above real life usecase. Several thin client processes invoke putAlls in several threads. There are two sets of parameters - one for TRANSACTIONAL and one for ATOMIC cache. *ThinClientConcurrencyTest::test_one_greedy_thin_client* Minimal test shows that a single thin client can produce such a load that another one can not connect. *** The following full stack trace is logged on the client side: {noformat} org.apache.ignite.client.ClientConnectionException: Channel is closed at org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595) ~[classes/:?] at org.apache.ignite.internal.client.thin.TcpClientChannel.<init>(TcpClientChannel.java:180) ~[classes/:?] at org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917) ~[classes/:?] at org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898) ~[classes/:?] at org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847) ~[classes/:?] at org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759) ~[classes/:?] at org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731) ~[classes/:?] at org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702) ~[classes/:?] at org.apache.ignite.internal.client.thin.TcpIgniteClient.<init>(TcpIgniteClient.java:126) ~[classes/:?] at org.apache.ignite.internal.client.thin.TcpIgniteClient.<init>(TcpIgniteClient.java:102) ~[classes/:?] at org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339) ~[classes/:?] at org.apache.ignite.Ignition.startClient(Ignition.java:615) ~[classes/:?] at org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.getClient(ThinClientDataGenerationApplication.java:181) ~[ignite-ducktests-2. 14.0-SNAPSHOT.jar:2.14.0-SNAPSHOT] at org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.call(ThinClientDataGenerationApplication.java:138) [ignite-ducktests-2.14.0-S NAPSHOT.jar:2.14.0-SNAPSHOT] at org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.call(ThinClientDataGenerationApplication.java:111) [ignite-ducktests-2.14.0-S NAPSHOT.jar:2.14.0-SNAPSHOT] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_322] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_322] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_322] at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322] Caused by: org.apache.ignite.IgniteCheckedException: Channel is closed at org.apache.ignite.internal.util.IgniteUtils.cast(IgniteUtils.java:7863) ~[classes/:?] at org.apache.ignite.internal.util.future.GridFutureAdapter.resolve(GridFutureAdapter.java:260) ~[classes/:?] at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:191) ~[classes/:?] at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) ~[classes/:?] at org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:591) ~[classes/:?] ... 18 more Caused by: org.apache.ignite.client.ClientConnectionException: Channel is closed at org.apache.ignite.internal.client.thin.TcpClientChannel.close(TcpClientChannel.java:213) ~[classes/:?] at org.apache.ignite.internal.client.thin.TcpClientChannel.onDisconnected(TcpClientChannel.java:199) ~[classes/:?] at org.apache.ignite.internal.client.thin.io.gridnioserver.GridNioClientConnection.onDisconnected(GridNioClientConnection.java:95) ~[classes/:?] at org.apache.ignite.internal.client.thin.io.gridnioserver.GridNioClientListener.onDisconnected(GridNioClientListener.java:42) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onSessionClosed(GridNioFilterChain.java:256) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioCodecFilter.onSessionClosed(GridNioCodecFilter.java:71) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onSessionClosed(GridNioServer.java:3690) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioFilterChain.onSessionClosed(GridNioFilterChain.java:149) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2877) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2794) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:1189) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2508) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2273) ~[classes/:?] at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1910) ~[classes/:?] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) ~[classes/:?] ... 1 more {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)