Hi Imesh, Yes any message will not be communicated when message broker is not available.
On Wed, Jul 30, 2014 at 7:24 PM, Imesh Gunaratne <im...@apache.org> wrote: > As I understood its not just the Member Fault event that is affected in > this scenario, any event that CEP publishes to message broker will > encounter the same problem. > > > On Wed, Jul 30, 2014 at 5:49 AM, Michiel Blokzijl (mblokzij) < > mblok...@cisco.com> wrote: > >> +1. >> >> If Stratos, or any component it relies on, fails, and eventually returns >> to service, Stratos should "orchestrate" the cloud back to the desired >> state. If any cartridges went missing and after some time T (post failure) >> Stratos hasn’t re-discovered them, they should be respawned. >> >> Best regards, >> >> Michiel >> >> >> On 30 Jul 2014, at 05:51, Isuru Haththotuwa <isu...@apache.org> wrote: >> >> >> >> >> On Wed, Jul 30, 2014 at 9:45 AM, Akila Ravihansa Perera < >> raviha...@wso2.com> wrote: >> >>> Hi Devs, >>> >>> Current Stratos architecture relies heavily on high availability of >>> the message broker. We faced a situation when MB is down, some of the >>> messages published will get lost forever and the system state will >>> never be recovered. >>> >>> One such example is, when a cartridge instance goes down the CEP >>> component will identify this event and publish a MemberFault event to >>> the MB's summarized-health-stat topic. But the problem is CEP >>> component creates its own list of cartridge instance members by >>> looking at health-stats published to MB - it does not consider the >>> topology. Hence, when a cartridge instance goes down, MemberFault >>> event will get fired only once. But if the MB is down at this time, it >>> will cause this message to be lost forever resulting in an un-stable >>> system state in which Stratos thinks a member exists but in reality it >>> is not the case. >>> >>> We can introduce a simple house keeping task to check whether every >>> member is alive. Ideally this should be auto-scaler's responsibility. >>> It will allow the system to recover itself from an un-stable >>> situation. I think this is a critical bug and should be given high >>> priority. >>> >>> Please share your thoughts. >>> >> +1. We would need to decide what is the best method for this though. If >> we consider CEP the central point of decision making, another option is to >> make it listen to topology and get the correct decision. Or else, we can >> use a health check mechanism for the MB which can detect if the MB is down >> and replay any of the messages. This IMO can be very useful since the >> primary communication mechanism in Stratos is the MB. >> >> One other important thing is to have fail-over/HA for MB. There can be >> many other occasion if the MB is down, the system going to a undefined >> state due to loss of messages. >> >>> >>> -- >>> Akila Ravihansa Perera >>> Software Engineer >>> WSO2 Inc. >>> http://wso2.com >>> >>> Blog: http://ravihansa3000.blogspot.com >>> >>> -- >>> <http://ravihansa3000.blogspot.com/> >>> Thanks and Regards, >>> >>> Isuru H. >>> <http://ravihansa3000.blogspot.com/> >>> +94 716 358 048 <http://ravihansa3000.blogspot.com/>* >>> <http://wso2.com/>* >>> >>> >>> * <http://wso2.com/>* >>> >>> >>> >> > > > -- > Imesh Gunaratne > > Technical Lead, WSO2 > Committer & PPMC Member, Apache Stratos > -- Udara Liyanage Software Engineer WSO2, Inc.: http://wso2.com lean. enterprise. middleware web: http://udaraliyanage.wordpress.com phone: +94 71 443 6897