[jira] [Updated] (IGNITE-14301) Authentication processor can hang all user management operation after server node reconnect

Mikhail Petrov (Jira) Tue, 06 Apr 2021 13:15:05 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mikhail Petrov updated IGNITE-14301:
------------------------------------
    Description: 
First for all look at the test - 
AuthenticationProcessorNodeRestartTest#testConcurrentAddUpdateRemoveNodeRestartServer
 which is flaky - [TC 
history|https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8873434544416175780&tab=testDetails]

The first problem with this test is that user management 
operations(add/update/remove) create too many discovery messages. So discovery 
custom message history size is not enough to properly skip duplicated custom 
messages that can be sent across the ring during server node reconnect. It 
leads to mentioned test failures due to duplication of user management 
operations (see GridDiscoveryManager#discoCacheHist, 
IGNITE_DISCOVERY_HISTORY_SIZE system property, and 
ServerImpl.RingMessageWorker#sendMessageAcrossRing).

If the discovery history size will be increased significantly, the test stops 
failing and starts hanging. The steps that lead to this:
 1. Client node sent UserProposedMessage across the ring while one node is 
offline due to reconnect. 
 2. Alive server nodes update their local user lists and finish the operation. 
 3. Reconnected node joins the ring and receives an updated user list from the 
coordinator.
 4. Reconnected node receives duplicated UserProposedMessage that has been 
already handled by all nodes, handles it, and sents 
UserManagementOperationFinishedMessage to the coordinator and start to wait for 
the UserAcceptedMessage from it. But the coordinator has already finished this 
operation. So the thread that responsible for user management operation on the 
reconnected node becomes blocked (see 
IgniteAuthenticationProcessor.UserOperationWorker#body).
 5. Client node starts the next operation that needs all alive nodes to respond 
with UserManagementOperationFinishedMessage. But reconnected node 
authentication thread is blocked. So this operation can't be completed at all.

This issue causes all tests in the AuthenticationProcessorNodeRestartTest test 
class to be flaky.

UPD

The described above problem seems to be fixed by the  
https://issues.apache.org/jira/browse/IGNITE-14375 - the TC history confirms it.

But locally I still face the hanging of AuthenticationProcessorNodeRestartTest 
tests. 
It reproduces stably for me if U.sleep(500); is inserted after 
processOperationLocal(op); in the UserOperationWorker#body method.

 The main reason of this behavior - 
 # assumes that there are two nodes in topology that receives 
UserProposedMessage - indicates start of a user management operation
 # each of them handles UserProposedMessage and removes corresponding active 
operation locally (see IgniteAuthenticationProcessor#activeOps / addUserLocal / 
updateUserLocal / removeUserLocal)
 # before  UserManagementOperationFinishedMessage is sent to coordinator new 
node joins the cluster (collects grid discovery data and cluster nodes receive 
NODE_JOINED event)
 # new node receives no active operations (because of the clause 2) but new 
node id is registered in UserOperationFinishFuture and the coordinator starts 
waiting for the response from the new node that have no user management 
operations in progerss.

 So the initial operation hangs.

So we must not register new joined node in UserOperationFinishFuture if 
collection of the grid discovery data for the new node and NODE_JOINED event 
occur after the local user map is updated and before user management operation 
is completed.

 

  was:
First for all look at the test - 
AuthenticationProcessorNodeRestartTest#testConcurrentAddUpdateRemoveNodeRestartServer
 which is flaky - [TC 
history|https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8873434544416175780&tab=testDetails]

The first problem with this test is that user management 
operations(add/update/remove) create too many discovery messages. So discovery 
custom message history size is not enough to properly skip duplicated custom 
messages that can be sent across the ring during server node reconnect. It 
leads to mentioned test failures due to duplication of user management 
operations (see GridDiscoveryManager#discoCacheHist, 
IGNITE_DISCOVERY_HISTORY_SIZE system property, and 
ServerImpl.RingMessageWorker#sendMessageAcrossRing).

If the discovery history size will be increased significantly, the test stops 
failing and starts hanging. The steps that lead to this:
 1. Client node sent UserProposedMessage across the ring while one node is 
offline due to reconnect. 
 2. Alive server nodes update their local user lists and finish the operation. 
 3. Reconnected node joins the ring and receives an updated user list from the 
coordinator.
 4. Reconnected node receives duplicated UserProposedMessage that has been 
already handled by all nodes, handles it, and sents 
UserManagementOperationFinishedMessage to the coordinator and start to wait for 
the UserAcceptedMessage from it. But the coordinator has already finished this 
operation. So the thread that responsible for user management operation on the 
reconnected node becomes blocked (see 
IgniteAuthenticationProcessor.UserOperationWorker#body).
 5. Client node starts the next operation that needs all alive nodes to respond 
with UserManagementOperationFinishedMessage. But reconnected node 
authentication thread is blocked. So this operation can't be completed at all.

This issue causes all tests in the AuthenticationProcessorNodeRestartTest test 
class to be flaky.

UPD

The described above problem seems to be fixed by the  
https://issues.apache.org/jira/browse/IGNITE-14375 - the TC history confirms it.

 


> Authentication processor can hang all user management operation after server 
> node reconnect
> -------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-14301
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14301
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mikhail Petrov
>            Priority: Major
>
> First for all look at the test - 
> AuthenticationProcessorNodeRestartTest#testConcurrentAddUpdateRemoveNodeRestartServer
>  which is flaky - [TC 
> history|https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8873434544416175780&tab=testDetails]
> The first problem with this test is that user management 
> operations(add/update/remove) create too many discovery messages. So 
> discovery custom message history size is not enough to properly skip 
> duplicated custom messages that can be sent across the ring during server 
> node reconnect. It leads to mentioned test failures due to duplication of 
> user management operations (see GridDiscoveryManager#discoCacheHist, 
> IGNITE_DISCOVERY_HISTORY_SIZE system property, and 
> ServerImpl.RingMessageWorker#sendMessageAcrossRing).
> If the discovery history size will be increased significantly, the test stops 
> failing and starts hanging. The steps that lead to this:
>  1. Client node sent UserProposedMessage across the ring while one node is 
> offline due to reconnect. 
>  2. Alive server nodes update their local user lists and finish the 
> operation. 
>  3. Reconnected node joins the ring and receives an updated user list from 
> the coordinator.
>  4. Reconnected node receives duplicated UserProposedMessage that has been 
> already handled by all nodes, handles it, and sents 
> UserManagementOperationFinishedMessage to the coordinator and start to wait 
> for the UserAcceptedMessage from it. But the coordinator has already finished 
> this operation. So the thread that responsible for user management operation 
> on the reconnected node becomes blocked (see 
> IgniteAuthenticationProcessor.UserOperationWorker#body).
>  5. Client node starts the next operation that needs all alive nodes to 
> respond with UserManagementOperationFinishedMessage. But reconnected node 
> authentication thread is blocked. So this operation can't be completed at all.
> This issue causes all tests in the AuthenticationProcessorNodeRestartTest 
> test class to be flaky.
> UPD
> The described above problem seems to be fixed by the  
> https://issues.apache.org/jira/browse/IGNITE-14375 - the TC history confirms 
> it.
> But locally I still face the hanging of 
> AuthenticationProcessorNodeRestartTest tests. 
> It reproduces stably for me if U.sleep(500); is inserted after 
> processOperationLocal(op); in the UserOperationWorker#body method.
>  The main reason of this behavior - 
>  # assumes that there are two nodes in topology that receives 
> UserProposedMessage - indicates start of a user management operation
>  # each of them handles UserProposedMessage and removes corresponding active 
> operation locally (see IgniteAuthenticationProcessor#activeOps / addUserLocal 
> / updateUserLocal / removeUserLocal)
>  # before  UserManagementOperationFinishedMessage is sent to coordinator new 
> node joins the cluster (collects grid discovery data and cluster nodes 
> receive NODE_JOINED event)
>  # new node receives no active operations (because of the clause 2) but new 
> node id is registered in UserOperationFinishFuture and the coordinator starts 
> waiting for the response from the new node that have no user management 
> operations in progerss.
>  So the initial operation hangs.
> So we must not register new joined node in UserOperationFinishFuture if 
> collection of the grid discovery data for the new node and NODE_JOINED event 
> occur after the local user map is updated and before user management 
> operation is completed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (IGNITE-14301) Authentication processor can hang all user management operation after server node reconnect

Reply via email to