Ignite TC Bot created IGNITE-28685:
--------------------------------------

             Summary: C++ thin client test IgniteClientUserThreadPoolSize is 
flaky
                 Key: IGNITE-28685
                 URL: https://issues.apache.org/jira/browse/IGNITE-28685
             Project: Ignite
          Issue Type: Bug
            Reporter: Ignite TC Bot


h2. Problem

TeamCity intermittently fails the C++ thin client test:
* Suite: Platform C++ CMake (Linux)
* Build: https://ci2.ignite.apache.org/viewLog.html?buildId=9060050
* Test: IgniteThinClientTest: IgniteClientTestSuite: 
IgniteClientUserThreadPoolSize
* Observed duration: about 14 ms

The failure was seen on PR 13121, but that PR changes only notice/license 
files, so it is very unlikely to be caused by the current PR code changes.

h2. Investigation

The test is implemented in:
* modules/platforms/cpp/thin-client-test/src/ignite_client_test.cpp
* method: IgniteClientTestSuiteFixture::CheckThreadsNum
* test case: IgniteClientUserThreadPoolSize

The test starts a fake thin-client server on the fixed port 11110, then 
measures the total process thread count through /proc/<pid>/task before and 
immediately after IgniteClient::Start. It expects the exact delta to be 
userThreadPoolSize + 1 on Linux.

This is brittle because:
* the assertion uses the whole process thread count, not only the client/user 
pool threads;
* startup/shutdown of native worker threads is asynchronous from the test point 
of view;
* port 11110 is reused by many other thin-client tests and configs in the same 
suite;
* base-branch history shows this test is flaky with a high failure rate.

Relevant code paths:
* modules/platforms/cpp/thin-client-test/src/ignite_client_test.cpp: 
CheckThreadsNum / IgniteClientUserThreadPoolSize
* modules/platforms/cpp/thin-client-test/include/test_server.h: TestServer
* modules/platforms/cpp/thin-client/src/impl/data_router.cpp: 
DataRouter::Connect / DataRouter::Close
* modules/platforms/cpp/common/src/common/thread_pool.cpp: ThreadPool::Start / 
ThreadPool::Stop
* modules/platforms/cpp/common/os/linux/src/common/concurrent_os.cpp: 
GetThreadsCount

h2. Proposed fix

Stabilize the test without changing production behavior:
# Start TestServer on an ephemeral port by constructing it with port 0.
# Expose TestServer::GetPort() and set IgniteClientConfiguration endpoints to 
the actual bound port.
# Replace single immediate thread-count samples with bounded waits for the 
expected count after client start and after client destruction.

A local draft patch already applies this shape:
* add uint16_t TestServer::GetPort() const in 
modules/platforms/cpp/thin-client-test/include/test_server.h;
* in IgniteClientUserThreadPoolSize, use TestServer server(0), build endpoint 
from server.GetPort(), and wait up to 5 seconds for thread count to settle.

h2. Expected result

The test should continue checking that the configured user thread pool size 
affects the number of client worker threads, while avoiding fixed-port 
collisions and scheduler-timing races.

h2. Validation

Run:
{code}
ctest -R IgniteThinClientTest --output-on-failure
{code}

or run the C++ thin-client test executable with:
{code}
ignite-thin-client-tests 
--run_test=IgniteClientTestSuite/IgniteClientUserThreadPoolSize 
--catch_system_errors=no --log_level=all
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to