Re: MemberFault event is lost forever when MB is down

Michiel Blokzijl (mblokzij) Thu, 18 Sep 2014 04:56:12 -0700

Hi,

I’m guessing the fix for [1] will be in 4.1.0, right? I’m glad you managed to 
resolve both issues with 1 fix!


Thanks and best regards,

Michiel


[1] https://issues.apache.org/jira/browse/STRATOS-795

On 12 Sep 2014, at 15:57, Lahiru Sandaruwan <[email protected]> wrote:

> Ok cool, Let's resolve the Jira. 
> 
> On Fri, Sep 12, 2014 at 5:51 PM, Akila Ravihansa Perera <[email protected]> 
> wrote:
> Hi Lahiru,
> 
> Yes, this is resolved now. Stratos will now check health stats against
> the member list published by CC to topology topic (CompleteTopology
> event). This will allow Stratos to recover from MB failures and also
> server unavailable situations.
> 
> Thanks.
> 
> On Fri, Sep 12, 2014 at 5:03 PM, Lahiru Sandaruwan <[email protected]> wrote:
> > Hi Akila,
> >
> > Would [1] also be solved with the solution we talked here?
> >
> > Thanks.
> > [1] https://issues.apache.org/jira/browse/STRATOS-795
> >
> > On Thu, Aug 28, 2014 at 12:48 PM, Akila Ravihansa Perera
> > <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> Since we're using WSO2 CEP for monitoring faulty members, it would
> >> make sense to enhance the Faulty Member window processor [1] to
> >> recover from a core component failure. I have made some improvements
> >> to this window processor and committed in [2].
> >>
> >> CEP will now have an additional dependency for Stratos messaging
> >> component (applicable only when using stand-alone CEP). Therefore it
> >> can now listen to the topology topic events published by CC. CEP will
> >> now check for cartridge agent health stats published by instances
> >> against the member list published by CC in complete topology event.
> >> Thus, even if the MemberFault event is lost in case of MB failure
> >> Stratos can recover itself since it will periodically check against
> >> member list published by CC. The code has been rigorously tested on
> >> EC2 and OpenStack.
> >>
> >> The other possible alternative (as opposed to dependency with
> >> messaging component) would be to create a new JMS input adaptor in CEP
> >> and listen to topology topic. But with this approach we will have to
> >> duplicate the messaging component model (topology structure) in CEP
> >> window processor. This is an un-necessary duplication IMHO.
> >>
> >> However, with this dependency for messaging component in CEP, if a
> >> user is deploying Stratos with a stand-alone CEP, then he will have to
> >> manually copy the messaging component artifacts to CEP plugins
> >> directory.
> >>
> >> Would appreciate your thoughts on this.
> >>
> >> [1]
> >> https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java
> >> [2]
> >> https://github.com/apache/stratos/commit/05e1ddc20a871b73b721487a13a2547cf9b8768d
> >>
> >> Thanks.
> >>
> >> On Wed, Jul 30, 2014 at 7:32 PM, Udara Liyanage <[email protected]> wrote:
> >> > Hi Imesh,
> >> >
> >> > Yes any message will not be communicated when message broker is not
> >> > available.
> >> >
> >> >
> >> > On Wed, Jul 30, 2014 at 7:24 PM, Imesh Gunaratne <[email protected]>
> >> > wrote:
> >> >>
> >> >> As I understood its not just the Member Fault event that is affected in
> >> >> this scenario, any event that CEP publishes to message broker will
> >> >> encounter
> >> >> the same problem.
> >> >>
> >> >>
> >> >> On Wed, Jul 30, 2014 at 5:49 AM, Michiel Blokzijl (mblokzij)
> >> >> <[email protected]> wrote:
> >> >>>
> >> >>> +1.
> >> >>>
> >> >>> If Stratos, or any component it relies on, fails, and eventually
> >> >>> returns
> >> >>> to service, Stratos should "orchestrate" the cloud back to the desired
> >> >>> state. If any cartridges went missing and after some time T (post
> >> >>> failure)
> >> >>> Stratos hasn’t re-discovered them, they should be respawned.
> >> >>>
> >> >>> Best regards,
> >> >>>
> >> >>> Michiel
> >> >>>
> >> >>>
> >> >>> On 30 Jul 2014, at 05:51, Isuru Haththotuwa <[email protected]> wrote:
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Wed, Jul 30, 2014 at 9:45 AM, Akila Ravihansa Perera
> >> >>> <[email protected]> wrote:
> >> >>>>
> >> >>>> Hi Devs,
> >> >>>>
> >> >>>> Current Stratos architecture relies heavily on high availability of
> >> >>>> the message broker. We faced a situation when MB is down, some of the
> >> >>>> messages published will get lost forever and the system state will
> >> >>>> never be recovered.
> >> >>>>
> >> >>>> One such example is, when a cartridge instance goes down the CEP
> >> >>>> component will identify this event and publish a MemberFault event to
> >> >>>> the MB's summarized-health-stat topic. But the problem is CEP
> >> >>>> component creates its own list of cartridge instance members by
> >> >>>> looking at health-stats published to MB - it does not consider the
> >> >>>> topology. Hence, when a cartridge instance goes down, MemberFault
> >> >>>> event will get fired only once. But if the MB is down at this time,
> >> >>>> it
> >> >>>> will cause this message to be lost forever resulting in an un-stable
> >> >>>> system state in which Stratos thinks a member exists but in reality
> >> >>>> it
> >> >>>> is not the case.
> >> >>>>
> >> >>>> We can introduce a simple house keeping task to check whether every
> >> >>>> member is alive. Ideally this should be auto-scaler's responsibility.
> >> >>>> It will allow the system to recover itself from an un-stable
> >> >>>> situation. I think this is a critical bug and should be given high
> >> >>>> priority.
> >> >>>>
> >> >>>> Please share your thoughts.
> >> >>>
> >> >>> +1. We would need to decide what is the best method for this though.
> >> >>> If
> >> >>> we consider CEP the central point of decision making, another option
> >> >>> is to
> >> >>> make it listen to topology and get the correct decision. Or else, we
> >> >>> can use
> >> >>> a health check mechanism for the MB which can detect if the MB is down
> >> >>> and
> >> >>> replay any of the messages. This IMO can be very useful since the
> >> >>> primary
> >> >>> communication mechanism in Stratos is the MB.
> >> >>>
> >> >>> One other important thing is to have fail-over/HA for MB. There can be
> >> >>> many other occasion if the MB is down, the system going to a undefined
> >> >>> state
> >> >>> due to loss of messages.
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Akila Ravihansa Perera
> >> >>>> Software Engineer
> >> >>>> WSO2 Inc.
> >> >>>> http://wso2.com
> >> >>>>
> >> >>>> Blog: http://ravihansa3000.blogspot.com
> >> >>>>
> >> >>>> --
> >> >>>> Thanks and Regards,
> >> >>>>
> >> >>>> Isuru H.
> >> >>>> +94 716 358 048
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Imesh Gunaratne
> >> >>
> >> >> Technical Lead, WSO2
> >> >> Committer & PPMC Member, Apache Stratos
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > Udara Liyanage
> >> > Software Engineer
> >> > WSO2, Inc.: http://wso2.com
> >> > lean. enterprise. middleware
> >> >
> >> > web: http://udaraliyanage.wordpress.com
> >> > phone: +94 71 443 6897
> >>
> >>
> >>
> >> --
> >> Akila Ravihansa Perera
> >> WSO2 Inc
> >>
> >> Blog: http://ravihansa3000.blogspot.com
> >
> >
> >
> >
> > --
> > --
> > Lahiru Sandaruwan
> > Committer and PMC member, Apache Stratos,
> > Senior Software Engineer,
> > WSO2 Inc., http://wso2.com
> > lean.enterprise.middleware
> >
> > email: [email protected] cell: (+94) 773 325 954
> > blog: http://lahiruwrites.blogspot.com/
> > twitter: http://twitter.com/lahirus
> > linked-in: http://lk.linkedin.com/pub/lahiru-sandaruwan/16/153/146
> >
> 
> 
> 
> --
> Akila Ravihansa Perera
> Software Engineer, WSO2
> 
> Blog: http://ravihansa3000.blogspot.com
> 
> 
> 
> -- 
> --
> Lahiru Sandaruwan
> Committer and PMC member, Apache Stratos,
> Senior Software Engineer,
> WSO2 Inc., http://wso2.com
> lean.enterprise.middleware
> 
> email: [email protected] cell: (+94) 773 325 954
> blog: http://lahiruwrites.blogspot.com/
> twitter: http://twitter.com/lahirus
> linked-in: http://lk.linkedin.com/pub/lahiru-sandaruwan/16/153/146
>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: MemberFault event is lost forever when MB is down

Reply via email to