[jira] [Commented] (KARAF-7861) Configuration replication missed due to race condition in cellar

Jerome Blanchard (Jira) Thu, 26 Sep 2024 12:35:22 -0700


    [ 
https://issues.apache.org/jira/browse/KARAF-7861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885075#comment-17885075
 ]


Jerome Blanchard commented on KARAF-7861:
-----------------------------------------

Yes we're using cellar 4.1.3

We fixed the issue in our fork using a retry mecanism based on a integrity 
check between the event sent and the map. Before sending event a has of the map 
is generated and added to the event. When received on other nodes, the event 
integrity hash is compared to the one calculated about the local map. If there 
is a diff, we retry to load the map 3 times before hanging out.

[https://github.com/Jahia/karaf-cellar/pull/15]

I started to work on another strategy avoiding using double event and relying 
on the one triggered when the hazelcast ReplicatedMap is modified. Each node 
received an event locally when map is replicated (at the right time so) but I 
didn't test it yet ; it is visible in a branch in that commit :

[https://github.com/Jahia/karaf-cellar/commit/418295812c28357b98c213ae2090ef30af5318dc#diff-06d87c1f5d0407eed01e43e3145a8591fe6ceb399ad8ffc8b505f08f23a2fe21L15]

By the way that option was for testing, creating a special case only for 
configuration brokes a little bit the global architecture and mostly impact the 
configuration so I didn't go more in that way for now.

> Configuration replication missed due to race condition in cellar
> ----------------------------------------------------------------
>
>                 Key: KARAF-7861
>                 URL: https://issues.apache.org/jira/browse/KARAF-7861
>             Project: Karaf
>          Issue Type: Bug
>          Components: cellar
>         Environment: Karaf using cellar in a clustered environment to 
> replicated configuration updates.
>            Reporter: Jerome Blanchard
>            Assignee: Jean-Baptiste Onofré
>            Priority: Major
>
> In a karaf cluster using cellar and more specifically cellar-config, updates 
> of a configuration on a node is not replicated to another node.
> Investigations are pointing a race condition where one node receives the 
> ClusterConfigurationEvent before the ReplicatedMap is effectively replicated 
> on the impacted node. Thus, the node does not store the configuration and the 
> local version keep staled.
> The race condition starts here :
> [https://github.com/Jahia/karaf-cellar/blob/47b6984217953a5263f7e1e0da040f488cef3a3e/config/src/main/java/org/apache/karaf/cellar/config/LocalConfigurationListener.java#L119-L127]
> and continues on another node here :
> [https://github.com/Jahia/karaf-cellar/blob/cellar-4.1.3-jahia-fixes/config/src/main/java/org/apache/karaf/cellar/config/ConfigurationEventHandler.java]
> Cellar is using a ReplicatedMap (hazelcast) to propagate configurations 
> accross cluster and the replication operation is asynchronous. Thus, if the 
> ClusterConfigurationEvent is received before the replication finish on the 
> target node, nothing happens and no error is dedected nor retry.
> To reproduce the problem we can use breakpoints (thread ones) :
>  * First one to simulate a long replicate operation by adding a breakpoint on 
> the emitting node in the class  
> *com.hazelcast.replicatedmap.impl.operation.ReplicateUpdateOperation.run()*
>  * Second one in cellar event listener that apply the replicated 
> configuration : 
> *org.apache.karaf.cellar.config.ConfigurationEventHandler.handle()* at line:  
> if (!equals(clusterDictionary, localDictionary) && 
> canDistributeConfig(localDictionary)) {
> Now you update a copnfiguration on the first node. On the target node, we can 
> see that the configuration is not updated we the event is received.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KARAF-7861) Configuration replication missed due to race condition in cellar

Reply via email to