[ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Shishkov updated IGNITE-16843:
-----------------------------------
    Labels: ducktests ise thin  (was: ducktests thin)

> Timeout while thin client connection
> ------------------------------------
>
>                 Key: IGNITE-16843
>                 URL: https://issues.apache.org/jira/browse/IGNITE-16843
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Sergey Korotkov
>            Priority: Minor
>              Labels: ducktests, ise, thin
>         Attachments: test_one_greedy_thin_client.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In usecases with several active thin clients producing noticable load to 
> cluster new thin clients can fail to connect with the 
> *"ClientConnectionException: Channel is closed"* error in the 
> *TcpClientChannel::handshake()* method.
> On server side warning *"Unable to perform handshake within timeout 
> [timeout=10000"* is logged.
> The problem can be easily reproduced by several large putAlls invoked in 
> parallel from several or single thin client.  Espesially for the 
> TRANSACTIONAL caches.  But ATOMIC caches are also affected - the only 
> difference is that for ATOMIC caches more parallelizm factor and larger 
> batches for putAlls are needed.
> ----
> The reason of the problem is a fact that a single queue is used in the ignite 
> node to serve all thin client related requests (queue in the 
> {_}GridThinClientExecutor{_}). Both working requests like _putAll_ and 
> control ones like _handshake_ which are used for connection establishment. 
> Working requests can live indefenitely in a queue (or at least longer then 
> {_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
> the ignite node to check if a _handshake_ request is processed in a timely 
> manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By 
> default the 10 seconds timeout is used 
> ({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires 
> the client session is closed forcibly.
> So, if one or several thin clients fill queue with long operations new 
> clients can not connect. 
> ----
> The real usecase reveals the problem is as follows.
>  * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
>  * One TRANSACTIONAL cache with backups=1
>  * About 30Gb of data on each node
>  * Several (upto 75 at the same time) thin clients loading data using putAlls 
> in 50000 records batches. Client connects, loads 3 batches and disconnects 
> (spark jobs in fact). In other words clients connect and disconnect 
> constantly.
>  * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
> ClientConnectorConfiguration
> ----
> Two ducktests were created to reproduce and isolate  the problem (see the 
> {*}ignitetest/tests/thin_client_concurrency{*} in the attached pull request).
> *{color:#ff0000}Note that asserts in testcases done in a way that test PASS 
> means that problem IS reproduced.{color}*  Tests check that at least one thin 
> client fails with the "Channel is closed" error and that server node log 
> contains warning about the handshake timeout.
> *ThinClientConcurrencyTest::test_many_thin_clients*
> Mimics the above real life usecase. Several thin client processes invoke 
> putAlls in several threads. There are two sets of parameters - one for 
> TRANSACTIONAL and one for ATOMIC cache.
> *ThinClientConcurrencyTest::test_one_greedy_thin_client*
> Minimal test shows that a single thin client can produce such a load that 
> another one can not connect.
> On the attached metrics screenshot the behaviour of the test is shown in 
> details:
> 1. The first thin client invoked and started data load with putAlls in 
> several threads
> 2. The second thin client is invoked once the queue is filled
> 3. After 10 seconds the session of the second client is closed
> 4. Executor takes the handshake request from queue and (erroneously?) 
> increases the _client_connector_AcceptedSessions_ metric (note that 
> _ActiveSessions _ wasn't increased).
> [^test_one_greedy_thin_client.png]
>  
> ----
> The following full stack trace is logged on the client side:
> {noformat}
> org.apache.ignite.client.ClientConnectionException: Channel is closed
>     at 
> org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.client.thin.TcpClientChannel.<init>(TcpClientChannel.java:180)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.client.thin.TcpIgniteClient.<init>(TcpIgniteClient.java:126)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.client.thin.TcpIgniteClient.<init>(TcpIgniteClient.java:102)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
>  ~[classes/:?]
>     at org.apache.ignite.Ignition.startClient(Ignition.java:615) ~[classes/:?]
>     at 
> org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.getClient(ThinClientDataGenerationApplication.java:181)
>  ~[ignite-ducktests-2.
> 14.0-SNAPSHOT.jar:2.14.0-SNAPSHOT]
>     at 
> org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.call(ThinClientDataGenerationApplication.java:138)
>  [ignite-ducktests-2.14.0-S
> NAPSHOT.jar:2.14.0-SNAPSHOT]
>     at 
> org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.call(ThinClientDataGenerationApplication.java:111)
>  [ignite-ducktests-2.14.0-S
> NAPSHOT.jar:2.14.0-SNAPSHOT]
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_322]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_322]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_322]
>     at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
> Caused by: org.apache.ignite.IgniteCheckedException: Channel is closed
>     at 
> org.apache.ignite.internal.util.IgniteUtils.cast(IgniteUtils.java:7863) 
> ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.resolve(GridFutureAdapter.java:260)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:191)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:591)
>  ~[classes/:?]
>     ... 18 more
> Caused by: org.apache.ignite.client.ClientConnectionException: Channel is 
> closed
>     at 
> org.apache.ignite.internal.client.thin.TcpClientChannel.close(TcpClientChannel.java:213)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.client.thin.TcpClientChannel.onDisconnected(TcpClientChannel.java:199)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.client.thin.io.gridnioserver.GridNioClientConnection.onDisconnected(GridNioClientConnection.java:95)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.client.thin.io.gridnioserver.GridNioClientListener.onDisconnected(GridNioClientListener.java:42)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onSessionClosed(GridNioFilterChain.java:256)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.nio.GridNioCodecFilter.onSessionClosed(GridNioCodecFilter.java:71)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onSessionClosed(GridNioServer.java:3690)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.nio.GridNioFilterChain.onSessionClosed(GridNioFilterChain.java:149)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2877)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2794)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:1189)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2508)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2273)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1910)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) 
> ~[classes/:?]
>     ... 1 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to