Hi Devs,

Current Stratos architecture relies heavily on high availability of
the message broker. We faced a situation when MB is down, some of the
messages published will get lost forever and the system state will
never be recovered.

One such example is, when a cartridge instance goes down the CEP
component will identify this event and publish a MemberFault event to
the MB's summarized-health-stat topic. But the problem is CEP
component creates its own list of cartridge instance members by
looking at health-stats published to MB - it does not consider the
topology. Hence, when a cartridge instance goes down, MemberFault
event will get fired only once. But if the MB is down at this time, it
will cause this message to be lost forever resulting in an un-stable
system state in which Stratos thinks a member exists but in reality it
is not the case.

We can introduce a simple house keeping task to check whether every
member is alive. Ideally this should be auto-scaler's responsibility.
It will allow the system to recover itself from an un-stable
situation. I think this is a critical bug and should be given high
priority.

Please share your thoughts.

-- 
Akila Ravihansa Perera
Software Engineer
WSO2 Inc.
http://wso2.com

Blog: http://ravihansa3000.blogspot.com

Reply via email to