[
https://issues.apache.org/jira/browse/UNOMI-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17923616#comment-17923616
]
Serge Huber commented on UNOMI-874:
-----------------------------------
Thanks for pointing that out [~jayblanc] . It is indeed a tricky problem that
would need fixing.
The more I think about it and the less I think we should keep the clustering
inside of Unomi. I propose that for Unomi V3 we remove the clustering-specific
code since even in cluster deployment it is not used. Removing Karaf Cellar and
Hazelcast will also make it much easier to upgrade to newer versions of Karaf.
I already have a prototype of V3 without the clustering and it seems to work
fine.
This doesn't mean that you could use Unomi in a cluster configuration anyway,
just that all the node-to-node sync would not be done anymore which is usually
something that you don't want to be done in real production environment.
Rolling deployments should be used instead.
> Cluster node config is empty
> ----------------------------
>
> Key: UNOMI-874
> URL: https://issues.apache.org/jira/browse/UNOMI-874
> Project: Apache Unomi
> Issue Type: Improvement
> Reporter: Jerome Blanchard
> Priority: Major
>
> We faced a recurring (but flaky) problem in the clustered version of UNOMI :
> Sometimes, one of ClusterNode contains a null configuration when queried
> throught /cxs/cluster, thus publisHostAddress or internalHostAddress are
> null and imply to takes into consideration that option when trying to reach
> cluster node from client side. More than that, that node is not reachable
> because of unexposed address.
> It may be linked to a Cellar configuration replication bug that cause one of
> the nodes to have that configuration problem :
> [https://issues.apache.org/jira/projects/KARAF/issues/KARAF-7861?filter=allopenissues&orderby=created+DESC%2C+priority+DESC%2C+updated+DESC]
> I think the replication problem occurs in ClusterServiceImpl.init() :
> [https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L191|https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L155]
> If any other node is doing the same init() phase at the same time, cellar bug
> occurs and make one of the config to be overridden by the other, causing a
> node to exists in the karaf cluster but not having a config exposed.
> When nodes are then listed in the getClusterNodes(), the global config for
> the publicURL (which is a combined string of all nodes publicURLs serparated
> by a ',') does not find it for a node :
> [https://github.com/apache/unomi/blob/81989bd816f49337d33171541a24daaef0856221/services/src/main/java/org/apache/unomi/services/impl/cluster/ClusterServiceImpl.java#L191]
> I proposed a patch for Karaf Cellar (in the Jahia fork) but for version 4.1.3
> and UNOMI rely on cellar 4.2.1.:
> [https://github.com/Jahia/karaf-cellar/commit/76ecb6b1993bfa0e9124ac8437fcfdd87249d048]
> Maybe backporting the fix could be an option...
> At least, considering adding a healthcheck status according to a invalid
> cluster node configuration could help to detect case : I suggest to add a
> check on a null value of both publicHostAddress and internalHostAddress to
> flag the cluster as not healthy.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)