[
https://issues.apache.org/jira/browse/IGNITE-28337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aleksandr Chesnokov updated IGNITE-28337:
-----------------------------------------
Description:
There is a race condition in TCP discovery when a server sends discovery
messages to Ignite client nodes.
In ServerImpl#sendMessageToClients, most discovery messages are serialized
before being enqueued to the ClientMessageWorker. However,
TcpDiscoveryNodeAddedMessage is handled differently: the message object itself
is placed into the queue, while msgBytes remains null. Later, in
ClientMessageWorker#writeToSocket, the worker detects msgBytes == null and
performs serialization in the client worker thread.
This approach is unsafe because TcpDiscoveryNodeAddedMessage is mutable and can
be modified concurrently by the ring message worker:
ServerImpl#prepareNodeAddedMessage edits fields such as topology, topology
history, and pending messages.
As a result, TestMetricUpdateFailure#test is flaky and contains errors such as:
* Invalid message type
* ClassCastException (e.g., TcpDiscoveryCheckFailedMessage cannot be cast to
DiscoveryDataPacket)
* Client join timeout
See
[https://ci2.ignite.apache.org/test/3305509330615033947?currentProjectId=IgniteTests24Java8&branch=&expandedTest=build%3A%28id%3A8949981%29%2Cid%3A2000000291]
The test reproduces it because it starts one server node and 20 client nodes
concurrently, what is a good stress situation for this part of code
was:
See ServerImpl.RingMessageWorker#sendMessageToClients
As a result, TestMetricUpdateFailure#test is flaky
See
[https://ci2.ignite.apache.org/test/3305509330615033947?currentProjectId=IgniteTests24Java8&branch=&expandedTest=build%3A%28id%3A8949981%29%2Cid%3A2000000291]
> TcpDiscoveryNodeAddedMessage may be serialized from mutated state in client
> message worker
> ------------------------------------------------------------------------------------------
>
> Key: IGNITE-28337
> URL: https://issues.apache.org/jira/browse/IGNITE-28337
> Project: Ignite
> Issue Type: Bug
> Reporter: Aleksandr Chesnokov
> Assignee: Aleksandr Chesnokov
> Priority: Major
> Labels: MakeTeamcityGreenAgain
> Time Spent: 1h
> Remaining Estimate: 0h
>
> There is a race condition in TCP discovery when a server sends discovery
> messages to Ignite client nodes.
> In ServerImpl#sendMessageToClients, most discovery messages are serialized
> before being enqueued to the ClientMessageWorker. However,
> TcpDiscoveryNodeAddedMessage is handled differently: the message object
> itself is placed into the queue, while msgBytes remains null. Later, in
> ClientMessageWorker#writeToSocket, the worker detects msgBytes == null and
> performs serialization in the client worker thread.
> This approach is unsafe because TcpDiscoveryNodeAddedMessage is mutable and
> can be modified concurrently by the ring message worker:
> ServerImpl#prepareNodeAddedMessage edits fields such as topology, topology
> history, and pending messages.
> As a result, TestMetricUpdateFailure#test is flaky and contains errors such
> as:
> * Invalid message type
> * ClassCastException (e.g., TcpDiscoveryCheckFailedMessage cannot be cast to
> DiscoveryDataPacket)
> * Client join timeout
> See
> [https://ci2.ignite.apache.org/test/3305509330615033947?currentProjectId=IgniteTests24Java8&branch=&expandedTest=build%3A%28id%3A8949981%29%2Cid%3A2000000291]
> The test reproduces it because it starts one server node and 20 client nodes
> concurrently, what is a good stress situation for this part of code
--
This message was sent by Atlassian Jira
(v8.20.10#820010)