Hi Akila, Thank you for bringing this up. +1 for this as this would lead to an unrecoverable state and we can not even find out where it went wrong.
I found another place in CloudControllerServiceImpl class which we need to fix. private void onClusterRemoval(final String clusterId) { ClusterContext ctxt = CloudControllerContext.getInstance().getClusterContext(clusterId); TopologyBuilder.handleClusterRemoved(ctxt); CloudControllerContext.getInstance().removeClusterContext(clusterId); CloudControllerContext.getInstance().removeMemberContextsOfCluster(clusterId); CloudControllerContext.getInstance().persist(); } As Imesh suggested it would be better to scan the complete code base to find out the places we need to fix. Thanks. On Tue, Sep 29, 2015 at 8:08 PM, Imesh Gunaratne <im...@apache.org> wrote: > Yes indeed! As I found the problem is in CloudControllerUtil class: > > public static void persistTopology(Topology topology) { > try { > > RegistryManager.getInstance().persist(CloudControllerConstants.TOPOLOGY_RESOURCE, > topology); > } catch (RegistryException e) { > String msg = "Failed to persist the Topology in registry. "; > log.fatal(msg, e); > } > } > > We might need to scan the entire codebase for such occurrences. > > Thanks > > > On Tue, Sep 29, 2015 at 4:26 PM, Reka Thirunavukkarasu <r...@wso2.com> > wrote: > >> +1 for handing these dependent operations as atomic in order to avoid >> inconsistency. It is a good thought. >> >> Better have the same approach for application model also as it has to >> update monitor hierarchy and application model at the same time. >> >> Thanks, >> Reka >> >> On Tue, Sep 29, 2015 at 1:57 AM, Akila Ravihansa Perera < >> raviha...@wso2.com> wrote: >> >>> Hi, >>> >>> This is to bring your attention to possible inconsistency that could >>> arise due to rare edge cases. For an eg: If you follow the piece of code at >>> [1], handleMemberTerminated can return without successfully removing a >>> member from topology but the next step will get executed which is to remove >>> the member context from CC's context, thus leading to inconsistencies. This >>> could result in permanent inconsistent state and system will never recover. >>> >>> I'm proposing the $subject to resolve this issue. I think simply >>> throwing an exception if an operation is not successful and handling those >>> exceptions gracefully should fix the problem rather than silently returning >>> from the method call. Above is only one example and we may have to find >>> such occurrences throughout the CC component since that is the only module >>> which writes/updates the topology. >>> >>> However, we need to check whether same problem can occur for Application >>> model maintained by AS and Tenant model maintained by SM. >>> @Reka: Any thoughts? >>> >>> I've created the JIRA [2] to track this issue. >>> >>> [1] >>> https://github.com/ravihansa3000/stratos/blob/stratos-4.1.x/components/org.apache.stratos.cloud.controller/src/main/java/org/apache/stratos/cloud/controller/services/impl/CloudControllerServiceUtil.java#L57 >>> [2] https://issues.apache.org/jira/browse/STRATOS-1578 >>> >>> Thanks. >>> >>> -- >>> Akila Ravihansa Perera >>> WSO2 Inc.; http://wso2.com/ >>> >>> Blog: http://ravihansa3000.blogspot.com >>> >> >> >> >> -- >> Reka Thirunavukkarasu >> Senior Software Engineer, >> WSO2, Inc.:http://wso2.com, >> Mobile: +94776442007 >> >> >> > > > -- > Imesh Gunaratne > > Senior Technical Lead, WSO2 > Committer & PMC Member, Apache Stratos > -- *Dinithi De Silva* Associate Software Engineer, WSO2 Inc. m:+94716667655 | e:dinit...@wso2.com | w: www.wso2.com | a: #20, Palm Grove, Colombo 03