+1.

If Stratos, or any component it relies on, fails, and eventually returns to 
service, Stratos should "orchestrate" the cloud back to the desired state. If 
any cartridges went missing and after some time T (post failure) Stratos hasn’t 
re-discovered them, they should be respawned.

Best regards,

Michiel

On 30 Jul 2014, at 05:51, Isuru Haththotuwa <isu...@apache.org> wrote:

> 
> 
> 
> On Wed, Jul 30, 2014 at 9:45 AM, Akila Ravihansa Perera <raviha...@wso2.com> 
> wrote:
> Hi Devs,
> 
> Current Stratos architecture relies heavily on high availability of
> the message broker. We faced a situation when MB is down, some of the
> messages published will get lost forever and the system state will
> never be recovered.
> 
> One such example is, when a cartridge instance goes down the CEP
> component will identify this event and publish a MemberFault event to
> the MB's summarized-health-stat topic. But the problem is CEP
> component creates its own list of cartridge instance members by
> looking at health-stats published to MB - it does not consider the
> topology. Hence, when a cartridge instance goes down, MemberFault
> event will get fired only once. But if the MB is down at this time, it
> will cause this message to be lost forever resulting in an un-stable
> system state in which Stratos thinks a member exists but in reality it
> is not the case.
> 
> We can introduce a simple house keeping task to check whether every
> member is alive. Ideally this should be auto-scaler's responsibility.
> It will allow the system to recover itself from an un-stable
> situation. I think this is a critical bug and should be given high
> priority.
> 
> Please share your thoughts.
> +1. We would need to decide what is the best method for this though. If we 
> consider CEP the central point of decision making, another option is to make 
> it listen to topology and get the correct decision. Or else, we can use a 
> health check mechanism for the MB which can detect if the MB is down and 
> replay any of the messages. This IMO can be very useful since the primary 
> communication mechanism in Stratos is the MB.
> 
> One other important thing is to have fail-over/HA for MB. There can be many 
> other occasion if the MB is down, the system going to a undefined state due 
> to loss of messages.
> 
> --
> Akila Ravihansa Perera
> Software Engineer
> WSO2 Inc.
> http://wso2.com
> 
> Blog: http://ravihansa3000.blogspot.com
> 
> -- 
> Thanks and Regards,
> 
> Isuru H.
> +94 716 358 048
> 
> 
> 
> 
> 

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to