Hi Akila, Would [1] also be solved with the solution we talked here?
Thanks. [1] https://issues.apache.org/jira/browse/STRATOS-795 On Thu, Aug 28, 2014 at 12:48 PM, Akila Ravihansa Perera <[email protected] > wrote: > Hi, > > Since we're using WSO2 CEP for monitoring faulty members, it would > make sense to enhance the Faulty Member window processor [1] to > recover from a core component failure. I have made some improvements > to this window processor and committed in [2]. > > CEP will now have an additional dependency for Stratos messaging > component (applicable only when using stand-alone CEP). Therefore it > can now listen to the topology topic events published by CC. CEP will > now check for cartridge agent health stats published by instances > against the member list published by CC in complete topology event. > Thus, even if the MemberFault event is lost in case of MB failure > Stratos can recover itself since it will periodically check against > member list published by CC. The code has been rigorously tested on > EC2 and OpenStack. > > The other possible alternative (as opposed to dependency with > messaging component) would be to create a new JMS input adaptor in CEP > and listen to topology topic. But with this approach we will have to > duplicate the messaging component model (topology structure) in CEP > window processor. This is an un-necessary duplication IMHO. > > However, with this dependency for messaging component in CEP, if a > user is deploying Stratos with a stand-alone CEP, then he will have to > manually copy the messaging component artifacts to CEP plugins > directory. > > Would appreciate your thoughts on this. > > [1] > https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java > [2] > https://github.com/apache/stratos/commit/05e1ddc20a871b73b721487a13a2547cf9b8768d > > Thanks. > > On Wed, Jul 30, 2014 at 7:32 PM, Udara Liyanage <[email protected]> wrote: > > Hi Imesh, > > > > Yes any message will not be communicated when message broker is not > > available. > > > > > > On Wed, Jul 30, 2014 at 7:24 PM, Imesh Gunaratne <[email protected]> > wrote: > >> > >> As I understood its not just the Member Fault event that is affected in > >> this scenario, any event that CEP publishes to message broker will > encounter > >> the same problem. > >> > >> > >> On Wed, Jul 30, 2014 at 5:49 AM, Michiel Blokzijl (mblokzij) > >> <[email protected]> wrote: > >>> > >>> +1. > >>> > >>> If Stratos, or any component it relies on, fails, and eventually > returns > >>> to service, Stratos should "orchestrate" the cloud back to the desired > >>> state. If any cartridges went missing and after some time T (post > failure) > >>> Stratos hasn’t re-discovered them, they should be respawned. > >>> > >>> Best regards, > >>> > >>> Michiel > >>> > >>> > >>> On 30 Jul 2014, at 05:51, Isuru Haththotuwa <[email protected]> wrote: > >>> > >>> > >>> > >>> > >>> On Wed, Jul 30, 2014 at 9:45 AM, Akila Ravihansa Perera > >>> <[email protected]> wrote: > >>>> > >>>> Hi Devs, > >>>> > >>>> Current Stratos architecture relies heavily on high availability of > >>>> the message broker. We faced a situation when MB is down, some of the > >>>> messages published will get lost forever and the system state will > >>>> never be recovered. > >>>> > >>>> One such example is, when a cartridge instance goes down the CEP > >>>> component will identify this event and publish a MemberFault event to > >>>> the MB's summarized-health-stat topic. But the problem is CEP > >>>> component creates its own list of cartridge instance members by > >>>> looking at health-stats published to MB - it does not consider the > >>>> topology. Hence, when a cartridge instance goes down, MemberFault > >>>> event will get fired only once. But if the MB is down at this time, it > >>>> will cause this message to be lost forever resulting in an un-stable > >>>> system state in which Stratos thinks a member exists but in reality it > >>>> is not the case. > >>>> > >>>> We can introduce a simple house keeping task to check whether every > >>>> member is alive. Ideally this should be auto-scaler's responsibility. > >>>> It will allow the system to recover itself from an un-stable > >>>> situation. I think this is a critical bug and should be given high > >>>> priority. > >>>> > >>>> Please share your thoughts. > >>> > >>> +1. We would need to decide what is the best method for this though. If > >>> we consider CEP the central point of decision making, another option > is to > >>> make it listen to topology and get the correct decision. Or else, we > can use > >>> a health check mechanism for the MB which can detect if the MB is down > and > >>> replay any of the messages. This IMO can be very useful since the > primary > >>> communication mechanism in Stratos is the MB. > >>> > >>> One other important thing is to have fail-over/HA for MB. There can be > >>> many other occasion if the MB is down, the system going to a undefined > state > >>> due to loss of messages. > >>>> > >>>> > >>>> -- > >>>> Akila Ravihansa Perera > >>>> Software Engineer > >>>> WSO2 Inc. > >>>> http://wso2.com > >>>> > >>>> Blog: http://ravihansa3000.blogspot.com > >>>> > >>>> -- > >>>> Thanks and Regards, > >>>> > >>>> Isuru H. > >>>> +94 716 358 048 > >>>> > >>>> > >>>> > >>>> > >>>> > >>> > >> > >> > >> > >> -- > >> Imesh Gunaratne > >> > >> Technical Lead, WSO2 > >> Committer & PPMC Member, Apache Stratos > > > > > > > > > > -- > > > > Udara Liyanage > > Software Engineer > > WSO2, Inc.: http://wso2.com > > lean. enterprise. middleware > > > > web: http://udaraliyanage.wordpress.com > > phone: +94 71 443 6897 > > > > -- > Akila Ravihansa Perera > WSO2 Inc > > Blog: http://ravihansa3000.blogspot.com > -- -- Lahiru Sandaruwan Committer and PMC member, Apache Stratos, Senior Software Engineer, WSO2 Inc., http://wso2.com lean.enterprise.middleware email: [email protected] cell: (+94) 773 325 954 blog: http://lahiruwrites.blogspot.com/ twitter: http://twitter.com/lahirus linked-in: http://lk.linkedin.com/pub/lahiru-sandaruwan/16/153/146
