[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2023-06-28 Thread Ilya Shishkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Shishkov updated IGNITE-16843:
---
Labels: ducktests ise thin  (was: ducktests thin)

> Timeout while thin client connection
> 
>
> Key: IGNITE-16843
> URL: https://issues.apache.org/jira/browse/IGNITE-16843
> Project: Ignite
>  Issue Type: Bug
>Reporter: Sergey Korotkov
>Priority: Minor
>  Labels: ducktests, ise, thin
> Attachments: test_one_greedy_thin_client.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In usecases with several active thin clients producing noticable load to 
> cluster new thin clients can fail to connect with the 
> *"ClientConnectionException: Channel is closed"* error in the 
> *TcpClientChannel::handshake()* method.
> On server side warning *"Unable to perform handshake within timeout 
> [timeout=1"* is logged.
> The problem can be easily reproduced by several large putAlls invoked in 
> parallel from several or single thin client.  Espesially for the 
> TRANSACTIONAL caches.  But ATOMIC caches are also affected - the only 
> difference is that for ATOMIC caches more parallelizm factor and larger 
> batches for putAlls are needed.
> 
> The reason of the problem is a fact that a single queue is used in the ignite 
> node to serve all thin client related requests (queue in the 
> {_}GridThinClientExecutor{_}). Both working requests like _putAll_ and 
> control ones like _handshake_ which are used for connection establishment. 
> Working requests can live indefenitely in a queue (or at least longer then 
> {_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
> the ignite node to check if a _handshake_ request is processed in a timely 
> manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By 
> default the 10 seconds timeout is used 
> ({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires 
> the client session is closed forcibly.
> So, if one or several thin clients fill queue with long operations new 
> clients can not connect. 
> 
> The real usecase reveals the problem is as follows.
>  * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
>  * One TRANSACTIONAL cache with backups=1
>  * About 30Gb of data on each node
>  * Several (upto 75 at the same time) thin clients loading data using putAlls 
> in 5 records batches. Client connects, loads 3 batches and disconnects 
> (spark jobs in fact). In other words clients connect and disconnect 
> constantly.
>  * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
> ClientConnectorConfiguration
> 
> Two ducktests were created to reproduce and isolate  the problem (see the 
> {*}ignitetest/tests/thin_client_concurrency{*} in the attached pull request).
> *{color:#ff}Note that asserts in testcases done in a way that test PASS 
> means that problem IS reproduced.{color}*  Tests check that at least one thin 
> client fails with the "Channel is closed" error and that server node log 
> contains warning about the handshake timeout.
> *ThinClientConcurrencyTest::test_many_thin_clients*
> Mimics the above real life usecase. Several thin client processes invoke 
> putAlls in several threads. There are two sets of parameters - one for 
> TRANSACTIONAL and one for ATOMIC cache.
> *ThinClientConcurrencyTest::test_one_greedy_thin_client*
> Minimal test shows that a single thin client can produce such a load that 
> another one can not connect.
> On the attached metrics screenshot the behaviour of the test is shown in 
> details:
> 1. The first thin client invoked and started data load with putAlls in 
> several threads
> 2. The second thin client is invoked once the queue is filled
> 3. After 10 seconds the session of the second client is closed
> 4. Executor takes the handshake request from queue and (erroneously?) 
> increases the _client_connector_AcceptedSessions_ metric (note that 
> _ActiveSessions _ wasn't increased).
> [^test_one_greedy_thin_client.png]
>  
> 
> The following full stack trace is logged on the client side:
> {noformat}
> org.apache.ignite.client.ClientConnectionException: Channel is closed
> at 
> org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
>  ~[classes/:?]
> at 
> 

[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Labels: ducktests thin  (was: )

> Timeout while thin client connection
> 
>
> Key: IGNITE-16843
> URL: https://issues.apache.org/jira/browse/IGNITE-16843
> Project: Ignite
>  Issue Type: Bug
>Reporter: Sergey Korotkov
>Priority: Minor
>  Labels: ducktests, thin
> Attachments: test_one_greedy_thin_client.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In usecases with several active thin clients producing noticable load to 
> cluster new thin clients can fail to connect with the 
> *"ClientConnectionException: Channel is closed"* error in the 
> *TcpClientChannel::handshake()* method.
> On server side warning *"Unable to perform handshake within timeout 
> [timeout=1"* is logged.
> The problem can be easily reproduced by several large putAlls invoked in 
> parallel from several or single thin client.  Espesially for the 
> TRANSACTIONAL caches.  But ATOMIC caches are also affected - the only 
> difference is that for ATOMIC caches more parallelizm factor and larger 
> batches for putAlls are needed.
> 
> The reason of the problem is a fact that a single queue is used in the ignite 
> node to serve all thin client related requests (queue in the 
> {_}GridThinClientExecutor{_}). Both working requests like _putAll_ and 
> control ones like _handshake_ which are used for connection establishment. 
> Working requests can live indefenitely in a queue (or at least longer then 
> {_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
> the ignite node to check if a _handshake_ request is processed in a timely 
> manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By 
> default the 10 seconds timeout is used 
> ({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires 
> the client session is closed forcibly.
> So, if one or several thin clients fill queue with long operations new 
> clients can not connect. 
> 
> The real usecase reveals the problem is as follows.
>  * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
>  * One TRANSACTIONAL cache with backups=1
>  * About 30Gb of data on each node
>  * Several (upto 75 at the same time) thin clients loading data using putAlls 
> in 5 records batches. Client connects, loads 3 batches and disconnects 
> (spark jobs in fact). In other words clients connect and disconnect 
> constantly.
>  * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
> ClientConnectorConfiguration
> 
> Two ducktests were created to reproduce and isolate  the problem (see the 
> {*}ignitetest/tests/thin_client_concurrency{*} in the attached pull request).
> *{color:#ff}Note that asserts in testcases done in a way that test PASS 
> means that problem IS reproduced.{color}*  Tests check that at least one thin 
> client fails with the "Channel is closed" error and that server node log 
> contains warning about the handshake timeout.
> *ThinClientConcurrencyTest::test_many_thin_clients*
> Mimics the above real life usecase. Several thin client processes invoke 
> putAlls in several threads. There are two sets of parameters - one for 
> TRANSACTIONAL and one for ATOMIC cache.
> *ThinClientConcurrencyTest::test_one_greedy_thin_client*
> Minimal test shows that a single thin client can produce such a load that 
> another one can not connect.
> On the attached metrics screenshot the behaviour of the test is shown in 
> details:
> 1. The first thin client invoked and started data load with putAlls in 
> several threads
> 2. The second thin client is invoked once the queue is filled
> 3. After 10 seconds the session of the second client is closed
> 4. Executor takes the handshake request from queue and (erroneously?) 
> increases the _client_connector_AcceptedSessions_ metric (note that 
> _ActiveSessions _ wasn't increased).
> [^test_one_greedy_thin_client.png]
>  
> 
> The following full stack trace is logged on the client side:
> {noformat}
> org.apache.ignite.client.ClientConnectionException: Channel is closed
> at 
> org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
>  ~[classes/:?]
> at 
> 

[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 75 at the same time) thin clients loading data using putAlls 
in 5 records batches. Client connects, loads 3 batches and disconnects 
(spark jobs in fact). In other words clients connect and disconnect constantly.
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration


Two ducktests were created to reproduce and isolate  the problem (see the 
{*}ignitetest/tests/thin_client_concurrency{*} in the attached pull request).

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot the behaviour of the test is shown in 
details:

1. The first thin client invoked and started data load with putAlls in several 
threads
2. The second thin client is invoked once the queue is filled
3. After 10 seconds the session of the second client is closed
4. Executor takes the handshake request from queue and (erroneously?) increases 
the _client_connector_AcceptedSessions_ metric (note that _ActiveSessions _ 
wasn't increased).

[^test_one_greedy_thin_client.png]

 

The following full stack trace is logged on the client side:
{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 

[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 75 at the same time) thin clients loading data using putAlls 
in 5 records batches. Client connects, loads 3 batches and disconnects 
(spark jobs in fact). In other words clients connect and disconnect constantly.
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration


Two ducktests were created to reproduce and isolate  the problem 
({*}ignitetest/tests/thin_client_concurrency{*}).

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot the behaviour of the test is shown in 
details:

1. The first thin client invoked and started data load with putAlls in several 
threads
2. The second thin client is invoked once the queue is filled
3. After 10 seconds the session of the second client is closed
4. Executor takes the handshake request from queue and (erroneously?) increases 
the _client_connector_AcceptedSessions_ metric (note that _ActiveSessions _ 
wasn't increased).

[^test_one_greedy_thin_client.png]

 

The following full stack trace is logged on the client side:
{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 

[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 70-100 at the same time) thin clients loading data using 
putAlls in 5 records batches. Client connects, loads 3 batches and 
disconnects (spark jobs in fact).
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration


Two ducktests were created to reproduce and isolate  the problem 
({*}ignitetest/tests/thin_client_concurrency{*}).

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot the behaviour of the test is shown in 
details:

1. The first thin client invoked and started data load with putAlls in several 
threads
2. The second thin client is invoked once the queue is filled
3. After 10 seconds the session of the second client is closed
4. Executor takes the handshake request from queue and (erroneously?) increases 
the _client_connector_AcceptedSessions_ metric (note that _ActiveSessions _ 
wasn't increased).

[^test_one_greedy_thin_client.png]

 

The following full stack trace is logged on the client side:
{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
 ~[classes/:?]
at 

[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 70-100 at the same time) thin clients loading data using 
putAlls in 5 records batches. Client connects, loads 3 batches and 
disconnects (spark jobs in fact).
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration



Two ducktests were created to reproduce and isolate  the problem.

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot the behaviour of the test is shown in 
details:

1. The first thin client invoked and started data load with putAlls in several 
threads
2. The second thin client is invoked once the queue is filled
3. After 10 seconds the session of the second client is closed
4. Executor takes the handshake request from queue and (erroneously?) increases 
the _client_connector_AcceptedSessions_ metric (note that _ActiveSessions _ 
wasn't increased).


!test_one_greedy_thin_client.png|thumbnail,width=200,height=150!

 



The following full stack trace is logged on the client side:

{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
 ~[classes/:?]
at 

[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 70-100 at the same time) thin clients loading data using 
putAlls in 5 records batches. Client connects, loads 3 batches and 
disconnects (spark jobs in fact).
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration



Two ducktests were created to reproduce and isolate  the problem.

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot the behaviour of the test is shown in 
details:

1. The first thin client invoked and started data load with putAlls in several 
threads
2. The second thin client is invoked once the queue is filled
3. After 10 seconds the session of the second client is closed
4. Executor takes the handshake request from queue and (erroneously?) increases 
the _client_connector_AcceptedSessions_ metric (note that _ActiveSessions _ 
wasn't increased).


!test_one_greedy_thin_client.png|thumbnail,width=200,height=200!

 



The following full stack trace is logged on the client side:

{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
 ~[classes/:?]
at 

[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 70-100 at the same time) thin clients loading data using 
putAlls in 5 records batches. Client connects, loads 3 batches and 
disconnects (spark jobs in fact).
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration


Two ducktests were created to reproduce and isolate  the problem.

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot the behaviour of the test is shown in 
details:

1. The first thin client invoked and started data load with putAlls in several 
threads
2. The second thin client is invoked once the queue is filled
3. After 10 seconds the session of the second client is closed
4. Executor takes the handshake request from queue and (erroneously?) increases 
the _client_connector_AcceptedSessions_ metric (note that _ActiveSessions _ 
wasn't increased).


!test_one_greedy_thin_client.png!

 

***

The following full stack trace is logged on the client side:
{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
 ~[classes/:?]
at 

[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 70-100 at the same time) thin clients loading data using 
putAlls in 5 records batches. Client connects, loads 3 batches and 
disconnects (spark jobs in fact).
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration


Two ducktests were created to reproduce and isolate  the problem.

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot the behaviour of the test is shown in 
details:

1. The first thin client invoked and started data load with putAlls in several 
threads
2. The second thin client is invoked once the queue is filled
3. After 10 seconds the session of the second client is closed
4. Executor takes the handshake request from queue and (erroneously?) increases 
the _client_connector_AcceptedSessions_ metric (note that _ActiveSessions _ 
wasn't increased).

!test_one_greedy_thin_client.png|thumbnail!

 

***

The following full stack trace is logged on the client side:
{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
 ~[classes/:?]
at 

[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 70-100 at the same time) thin clients loading data using 
putAlls in 5 records batches. Client connects, loads 3 batches and 
disconnects (spark jobs in fact).
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration


Two ducktests were created to reproduce and isolate  the problem.

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot shows the behaviour of the test in details:

!test_one_greedy_thin_client.png|thumbnail, width=300,height=400!

 

***

The following full stack trace is logged on the client side:
{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
 ~[classes/:?]
at org.apache.ignite.Ignition.startClient(Ignition.java:615) ~[classes/:?]
at 
org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.getClient(ThinClientDataGenerationApplication.java:181)
 ~[ignite-ducktests-2.
14.0-SNAPSHOT.jar:2.14.0-SNAPSHOT]
at 

[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Description: 
In usecases with several active thin clients producing noticable load to 
cluster new thin clients can fail to connect with the 
*"ClientConnectionException: Channel is closed"* error in the 
*TcpClientChannel::handshake()* method.

On server side warning *"Unable to perform handshake within timeout 
[timeout=1"* is logged.

The problem can be easily reproduced by several large putAlls invoked in 
parallel from several or single thin client.  Espesially for the TRANSACTIONAL 
caches.  But ATOMIC caches are also affected - the only difference is that for 
ATOMIC caches more parallelizm factor and larger batches for putAlls are needed.

The reason of the problem is a fact that a single queue is used in the ignite 
node to serve all thin client related requests (queue in the 
{_}GridThinClientExecutor{_}). Both working requests like _putAll_ and control 
ones like _handshake_ which are used for connection establishment. 

Working requests can live indefenitely in a queue (or at least longer then 
{_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
the ignite node to check if a _handshake_ request is processed in a timely 
manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By default 
the 10 seconds timeout is used 
({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires the 
client session is closed forcibly.

So, if one or several thin clients fill queue with long operations new clients 
can not connect. 

The real usecase reveals the problem is as follows.
 * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
 * One TRANSACTIONAL cache with backups=1
 * About 30Gb of data on each node
 * Several (upto 70-100 at the same time) thin clients loading data using 
putAlls in 5 records batches. Client connects, loads 3 batches and 
disconnects (spark jobs in fact).
 * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
ClientConnectorConfiguration


Two ducktests were created to reproduce and isolate  the problem.

*{color:#ff}Note that asserts in testcases done in a way that test PASS 
means that problem IS reproduced.{color}*  Tests check that at least one thin 
client fails with the "Channel is closed" error and that server node log 
contains warning about the handshake timeout.

*ThinClientConcurrencyTest::test_many_thin_clients*

Mimics the above real life usecase. Several thin client processes invoke 
putAlls in several threads. There are two sets of parameters - one for 
TRANSACTIONAL and one for ATOMIC cache.

*ThinClientConcurrencyTest::test_one_greedy_thin_client*

Minimal test shows that a single thin client can produce such a load that 
another one can not connect.

On the attached metrics screenshot shows the behaviour of the test in details:

 

***

The following full stack trace is logged on the client side:
{noformat}
org.apache.ignite.client.ClientConnectionException: Channel is closed
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
 ~[classes/:?]
at 
org.apache.ignite.internal.client.thin.TcpIgniteClient.start(TcpIgniteClient.java:339)
 ~[classes/:?]
at org.apache.ignite.Ignition.startClient(Ignition.java:615) ~[classes/:?]
at 
org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.getClient(ThinClientDataGenerationApplication.java:181)
 ~[ignite-ducktests-2.
14.0-SNAPSHOT.jar:2.14.0-SNAPSHOT]
at 
org.apache.ignite.internal.ducktest.tests.thin_client_test.ThinClientDataGenerationApplication$PutJob.call(ThinClientDataGenerationApplication.java:138)
 

[jira] [Updated] (IGNITE-16843) Timeout while thin client connection

2022-04-13 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-16843:
-
Attachment: test_one_greedy_thin_client.png

> Timeout while thin client connection
> 
>
> Key: IGNITE-16843
> URL: https://issues.apache.org/jira/browse/IGNITE-16843
> Project: Ignite
>  Issue Type: Bug
>Reporter: Sergey Korotkov
>Priority: Minor
> Attachments: test_one_greedy_thin_client.png
>
>
> In usecases with several active thin clients producing noticable load to 
> cluster new thin clients can fail to connect with the 
> *"ClientConnectionException: Channel is closed"* error in the 
> *TcpClientChannel::handshake()* method.
> On server side warning *"Unable to perform handshake within timeout 
> [timeout=1"* is logged.
> The problem can be easily reproduced by several large putAlls invoked in 
> parallel from several or single thin client.  Espesially for the 
> TRANSACTIONAL caches.  But ATOMIC caches are also affected - the only 
> difference is that for ATOMIC caches more parallelizm factor and larger 
> batches for putAlls are needed.
> 
> The reason of the problem is a fact that a single queue is used in the ignite 
> node to serve all thin client related requests (queue in the 
> {_}GridThinClientExecutor{_}). Both working requests like _putAll_ and 
> control ones like _handshake_ which are used for connection establishment. 
> Working requests can live indefenitely in a queue (or at least longer then 
> {_}handshake{_}).  On the other hand a special watchdog timer is scheduled in 
> the ignite node to check if a _handshake_ request is processed in a timely 
> manner ({_}ClientListenerNioListener::scheduleHandshakeTimeout{_}).  By 
> default the 10 seconds timeout is used 
> ({_}ClientConnectorConfiguration::handshakeTimeout{_}). If timeout expires 
> the client session is closed forcibly.
> So, if one or several thin clients fill queue with long operations new 
> clients can not connect. 
> 
> The real usecase reveals the problem is as follows.
>  * 4 nodes cluster, 64 cpu, 32Gb heap, 512Gb off-heap  each
>  * One TRANSACTIONAL cache with backups=1
>  * About 30Gb of data on each node
>  * Several (upto 70-100 at the same time) thin clients loading data using 
> putAlls in 5 records batches. Client connects, loads 3 batches and 
> disconnects (spark jobs in fact).
>  * Default handshakeTimeout (10 secs) and threadPoolSize(8) in 
> ClientConnectorConfiguration
> 
> Two ducktests were created to reproduce and isolate  the problem.
> *{color:#ff}Note that asserts in testcases done in a way that test PASS 
> means that problem IS reproduced.{color}*  Tests check that at least one thin 
> client fails with the "Channel is closed" error and that server node log 
> contains warning about the handshake timeout.
> *ThinClientConcurrencyTest::test_many_thin_clients*
> Mimics the above real life usecase. Several thin client processes invoke 
> putAlls in several threads. There are two sets of parameters - one for 
> TRANSACTIONAL and one for ATOMIC cache.
> *ThinClientConcurrencyTest::test_one_greedy_thin_client*
> Minimal test shows that a single thin client can produce such a load that 
> another one can not connect.
> On the attached metrics screenshot shows the behaviour of the test in details:
>  
> ***
> The following full stack trace is logged on the client side:
> {noformat}
> org.apache.ignite.client.ClientConnectionException: Channel is closed
> at 
> org.apache.ignite.internal.client.thin.TcpClientChannel.handshake(TcpClientChannel.java:595)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.TcpClientChannel.(TcpClientChannel.java:180)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:917)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.getOrCreateChannel(ReliableChannel.java:898)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel$ClientChannelHolder.access$200(ReliableChannel.java:847)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:759)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel.applyOnDefaultChannel(ReliableChannel.java:731)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.ReliableChannel.channelsInit(ReliableChannel.java:702)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:126)
>  ~[classes/:?]
> at 
> org.apache.ignite.internal.client.thin.TcpIgniteClient.(TcpIgniteClient.java:102)
>  ~[classes/:?]
> at 
>