+1. If Stratos, or any component it relies on, fails, and eventually returns to service, Stratos should "orchestrate" the cloud back to the desired state. If any cartridges went missing and after some time T (post failure) Stratos hasn’t re-discovered them, they should be respawned.
Best regards, Michiel On 30 Jul 2014, at 05:51, Isuru Haththotuwa <isu...@apache.org> wrote: > > > > On Wed, Jul 30, 2014 at 9:45 AM, Akila Ravihansa Perera <raviha...@wso2.com> > wrote: > Hi Devs, > > Current Stratos architecture relies heavily on high availability of > the message broker. We faced a situation when MB is down, some of the > messages published will get lost forever and the system state will > never be recovered. > > One such example is, when a cartridge instance goes down the CEP > component will identify this event and publish a MemberFault event to > the MB's summarized-health-stat topic. But the problem is CEP > component creates its own list of cartridge instance members by > looking at health-stats published to MB - it does not consider the > topology. Hence, when a cartridge instance goes down, MemberFault > event will get fired only once. But if the MB is down at this time, it > will cause this message to be lost forever resulting in an un-stable > system state in which Stratos thinks a member exists but in reality it > is not the case. > > We can introduce a simple house keeping task to check whether every > member is alive. Ideally this should be auto-scaler's responsibility. > It will allow the system to recover itself from an un-stable > situation. I think this is a critical bug and should be given high > priority. > > Please share your thoughts. > +1. We would need to decide what is the best method for this though. If we > consider CEP the central point of decision making, another option is to make > it listen to topology and get the correct decision. Or else, we can use a > health check mechanism for the MB which can detect if the MB is down and > replay any of the messages. This IMO can be very useful since the primary > communication mechanism in Stratos is the MB. > > One other important thing is to have fail-over/HA for MB. There can be many > other occasion if the MB is down, the system going to a undefined state due > to loss of messages. > > -- > Akila Ravihansa Perera > Software Engineer > WSO2 Inc. > http://wso2.com > > Blog: http://ravihansa3000.blogspot.com > > -- > Thanks and Regards, > > Isuru H. > +94 716 358 048 > > > > >
signature.asc
Description: Message signed with OpenPGP using GPGMail