[
https://issues.apache.org/jira/browse/IGNITE-28369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079635#comment-18079635
]
Mikhail Petrov commented on IGNITE-28369:
-----------------------------------------
[~NSAmelchev] Thank you for the review.
> Ignite Service may not be redeployed if several nodes leave the cluster
> -----------------------------------------------------------------------
>
> Key: IGNITE-28369
> URL: https://issues.apache.org/jira/browse/IGNITE-28369
> Project: Ignite
> Issue Type: Bug
> Reporter: Mikhail Petrov
> Assignee: Mikhail Petrov
> Priority: Major
> Labels: ise
> Fix For: 2.19
>
> Time Spent: 2h 50m
> Remaining Estimate: 0h
>
> We need to fix flaky
> org.apache.ignite.client.ReliabilityTest#testServiceProxyFailover test.
> See
> https://ci2.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=4795807857625973920&tab=testDetails
> for more details.
> The org.apache.ignite.client.ReliabilityTest#testServiceProxyFailover test
> can be considered as a reproducer to the mentioned problem.
> To increase test failure rate - place U.sleep(10) in the
> GridNioServer.AbstractNioClientWorker#bodyInternal worker loop.
> Steps that result is described problem:
> 1. Consider a 3-node cluster with a singleton SERVICE deployed on node 1.
> 2. Node 1 leaves the cluster, triggering a distributed service redeployment
> process.
> 3. The service is reassigned to node 2.
> 3. While the coordinator waits for all nodes to reply with single messages,
> node 2 leaves the cluster.
> 4. The coordinator receives the event that node 2 has left the cluster and
> stops waiting for its single message.
> 5. The coordinator combines the received singleton messages into a full
> message that contains no information about the SERVICE or its topology. And
> sends it across the cluster.
> 6. Service topology is set as empty on all cluster nodes.
> 7. A second service redeployment process is triggered by the leaving of node
> 2. However, at this point, we do not attempt to redeploy the SERVICE because
> the node 2 is not part of the current service topology. Therefore, nothing
> happens. And the service becomes unavailable.
> Even if we fix step 7 and the service is eventually redeployed, there is a
> period of time during which the service topology is unknown. Currently, all
> calls during this period will result in an error. This is unexpected for a
> user.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)