Re: MemberFault event is lost forever when MB is down

Lahiru Sandaruwan Fri, 12 Sep 2014 04:34:23 -0700

Hi Akila,

Would [1] also be solved with the solution we talked here?


Thanks.
[1] https://issues.apache.org/jira/browse/STRATOS-795

On Thu, Aug 28, 2014 at 12:48 PM, Akila Ravihansa Perera <[email protected]
> wrote:

> Hi,
>
> Since we're using WSO2 CEP for monitoring faulty members, it would
> make sense to enhance the Faulty Member window processor [1] to
> recover from a core component failure. I have made some improvements
> to this window processor and committed in [2].
>
> CEP will now have an additional dependency for Stratos messaging
> component (applicable only when using stand-alone CEP). Therefore it
> can now listen to the topology topic events published by CC. CEP will
> now check for cartridge agent health stats published by instances
> against the member list published by CC in complete topology event.
> Thus, even if the MemberFault event is lost in case of MB failure
> Stratos can recover itself since it will periodically check against
> member list published by CC. The code has been rigorously tested on
> EC2 and OpenStack.
>
> The other possible alternative (as opposed to dependency with
> messaging component) would be to create a new JMS input adaptor in CEP
> and listen to topology topic. But with this approach we will have to
> duplicate the messaging component model (topology structure) in CEP
> window processor. This is an un-necessary duplication IMHO.
>
> However, with this dependency for messaging component in CEP, if a
> user is deploying Stratos with a stand-alone CEP, then he will have to
> manually copy the messaging component artifacts to CEP plugins
> directory.
>
> Would appreciate your thoughts on this.
>
> [1]
> https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java
> [2]
> https://github.com/apache/stratos/commit/05e1ddc20a871b73b721487a13a2547cf9b8768d
>
> Thanks.
>
> On Wed, Jul 30, 2014 at 7:32 PM, Udara Liyanage <[email protected]> wrote:
> > Hi Imesh,
> >
> > Yes any message will not be communicated when message broker is not
> > available.
> >
> >
> > On Wed, Jul 30, 2014 at 7:24 PM, Imesh Gunaratne <[email protected]>
> wrote:
> >>
> >> As I understood its not just the Member Fault event that is affected in
> >> this scenario, any event that CEP publishes to message broker will
> encounter
> >> the same problem.
> >>
> >>
> >> On Wed, Jul 30, 2014 at 5:49 AM, Michiel Blokzijl (mblokzij)
> >> <[email protected]> wrote:
> >>>
> >>> +1.
> >>>
> >>> If Stratos, or any component it relies on, fails, and eventually
> returns
> >>> to service, Stratos should "orchestrate" the cloud back to the desired
> >>> state. If any cartridges went missing and after some time T (post
> failure)
> >>> Stratos hasn’t re-discovered them, they should be respawned.
> >>>
> >>> Best regards,
> >>>
> >>> Michiel
> >>>
> >>>
> >>> On 30 Jul 2014, at 05:51, Isuru Haththotuwa <[email protected]> wrote:
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Jul 30, 2014 at 9:45 AM, Akila Ravihansa Perera
> >>> <[email protected]> wrote:
> >>>>
> >>>> Hi Devs,
> >>>>
> >>>> Current Stratos architecture relies heavily on high availability of
> >>>> the message broker. We faced a situation when MB is down, some of the
> >>>> messages published will get lost forever and the system state will
> >>>> never be recovered.
> >>>>
> >>>> One such example is, when a cartridge instance goes down the CEP
> >>>> component will identify this event and publish a MemberFault event to
> >>>> the MB's summarized-health-stat topic. But the problem is CEP
> >>>> component creates its own list of cartridge instance members by
> >>>> looking at health-stats published to MB - it does not consider the
> >>>> topology. Hence, when a cartridge instance goes down, MemberFault
> >>>> event will get fired only once. But if the MB is down at this time, it
> >>>> will cause this message to be lost forever resulting in an un-stable
> >>>> system state in which Stratos thinks a member exists but in reality it
> >>>> is not the case.
> >>>>
> >>>> We can introduce a simple house keeping task to check whether every
> >>>> member is alive. Ideally this should be auto-scaler's responsibility.
> >>>> It will allow the system to recover itself from an un-stable
> >>>> situation. I think this is a critical bug and should be given high
> >>>> priority.
> >>>>
> >>>> Please share your thoughts.
> >>>
> >>> +1. We would need to decide what is the best method for this though. If
> >>> we consider CEP the central point of decision making, another option
> is to
> >>> make it listen to topology and get the correct decision. Or else, we
> can use
> >>> a health check mechanism for the MB which can detect if the MB is down
> and
> >>> replay any of the messages. This IMO can be very useful since the
> primary
> >>> communication mechanism in Stratos is the MB.
> >>>
> >>> One other important thing is to have fail-over/HA for MB. There can be
> >>> many other occasion if the MB is down, the system going to a undefined
> state
> >>> due to loss of messages.
> >>>>
> >>>>
> >>>> --
> >>>> Akila Ravihansa Perera
> >>>> Software Engineer
> >>>> WSO2 Inc.
> >>>> http://wso2.com
> >>>>
> >>>> Blog: http://ravihansa3000.blogspot.com
> >>>>
> >>>> --
> >>>> Thanks and Regards,
> >>>>
> >>>> Isuru H.
> >>>> +94 716 358 048
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Imesh Gunaratne
> >>
> >> Technical Lead, WSO2
> >> Committer & PPMC Member, Apache Stratos
> >
> >
> >
> >
> > --
> >
> > Udara Liyanage
> > Software Engineer
> > WSO2, Inc.: http://wso2.com
> > lean. enterprise. middleware
> >
> > web: http://udaraliyanage.wordpress.com
> > phone: +94 71 443 6897
>
>
>
> --
> Akila Ravihansa Perera
> WSO2 Inc
>
> Blog: http://ravihansa3000.blogspot.com
>



-- 
--
Lahiru Sandaruwan
Committer and PMC member, Apache Stratos,
Senior Software Engineer,
WSO2 Inc., http://wso2.com
lean.enterprise.middleware

email: [email protected] cell: (+94) 773 325 954
blog: http://lahiruwrites.blogspot.com/
twitter: http://twitter.com/lahirus
linked-in: http://lk.linkedin.com/pub/lahiru-sandaruwan/16/153/146

Re: MemberFault event is lost forever when MB is down

Reply via email to