Hi Michiel, Thanks for bringing this up..Please fine my comments in line.
On Mon, Jun 23, 2014 at 9:53 PM, Michiel Blokzijl (mblokzij) < [email protected]> wrote: > Hi all, > > Basically, I was stopping and starting Stratos and looking at how it > handled dying cartridges, and found that Stratos only detected cartridge > deaths while it was running.. > > *The problem* > In steady state, I have some cartridges managed by Stratos, > > ./stratos.sh list-subscribed-cartridges | grep samp > | cisco-sample-vm | cisco-sample-vm | 1 | Single-Tenant | > cisco-sample-vm | Active | 1 | cisco-samp > le-vm.foo.cisco.com | > > nova list | grep samp > | 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE | None > | Running | core=172.16.2.17, 10.86.205.231 | > > All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample > cartridge, and then start ActiveMQ and Stratos again. > > Now, at first things look good..: > > ./stratos.sh list-subscribed-cartridges | grep samp > | cisco-sample-vm | cisco-sample-vm | 1 | Single-Tenant | > cisco-sample-vm | Inactive | 0 | > cisco-sample-vm.foo.cisco.com | > > But then, > > root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh > list-subscribed-cartridges | grep samp > | cisco-sample-vm | cisco-sample-vm | 1 | Single-Tenant | > cisco-sample-vm | Active | 1 | cisco-samp > le-vm.foo.cisco.com | > > # nova list | grep samp > # > > How did the cartridge become active without it actually being there? As > far as I can tell, Stratos never recovers from this. > Le me explain the behavior here. At the restarts itself, stratos will recover the data from registry. Until it recovers the data, you won't be able to see the list of subscribed cartridges. AFAIK, after stratos loaded the data from registry, then in memory data model only gets updated with the Topology events. We periodically update the registry from in memory model in order to persist the data. So this state change could be updated by cartridge agent. If it is not, then we will have to debug on this more. I'm not sure whether https://issues.apache.org/jira/browse/STRATOS-685 has any impact here. If you could see this consistently, then can you create a jira for this. So that we can work on it. > > I found this bug here: https://issues.apache.org/jira/browse/STRATOS-234 - > is this describing the issue I’m seeing? I was a little bit confused by the > usage of the word “obsolete”. > I think that Lahiru can add more on here.. > > *Where to go next?* > Now, I’ve done a little bit of digging, but I don’t yet have a full mental > model of how everything fits together in Stratos - please could someone > help me put the pieces together? :) > > What I’m seeing is the following: > - The cluster monitor appears to be active: > > TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG > {org.apache.stratos.autoscaler.monitor.ClusterMonitor} - Cluster monitor > is running.. Cluste > rMonitor [clusterId=cisco-sample-vm.cisco-sample-v, > serviceId=cisco-sample-vm, deploymentPolicy=Deployment Policy [id]static-1 > [partitions] [org > > .apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0], > autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null, de > scription=null], lbReferenceType=null] > {org.apache.stratos.autoscaler.monitor.ClusterMonitor} > > - It looks like the CEP FaultHandlingWindowProcessor usually detects > inactive members. However, since this member was never active, the > timeStampMap doesn’t contain an element for this member, so it’s never > checked. > In order for CEP's FaultHandlingWindowProcessor(custom window) used by GradientOfHealthRequest( execution plan) to detect the unhealthy member, the member should become active at the very first time. There after only, events are getting sent by cartirdge agent to CEP. So that CEP will keep track of the events and triggers the execution plan (GradientOfHealthRequest). In the execution, if FaultHandlingWindowProcessor detects that events are not received for more the one minute, then it will identify that member as inactive and put it into fault_message stream. Later on, FaultMessageEventFormatter in CEP will read this stream and publish the message to message broker. Then autoscaler will receive it from message broker and perform the necessary actions. - I think the fault handling is triggered by a fault_message, but I didn’t > manage to figure out where it’s coming from. Does anyone know what triggers > it? (is it the CEP extension?) > Hope above explains what you have asked. > > Anyway.. > > *Questions* > - How should Stratos detect after some downtime which cartridges are still > there and which ones aren’t? (what was the intended design?) > If the cartridge become active, then it will be the responsibility of CEP to detect the failures. Other wise, if it is an unhealthy instance even from the instance spawn, then autoscaler will keep track of it and will terminate it after sometime. > - Why did the missing cartridge go “active”? Is this a result from > restoring persistent state? (If I look in the registry I can see stuff > under subscriptions/active, but not sure if that’s where it comes from) > I hope that above explained this as well. - Who should be responsible for detecting the absence of an instance - the > ClusterMonitor? That seems to be fed incorrect data, since it clearly > thinks there are enough instances running. Which component has the > necessary data? > absence of an instance will be detected by CEP. ClusterMonitor always keeps track of whether minimum number of instances are up and running or whether it required to be scaled based on the stats received. > - It looks like it’s possible to snapshot CEP state > <http://stackoverflow.com/questions/20348326/wso2-cep-siddhi-how-to-make-time-windows-persistent> > to > make it semi-persistent. However, if I restarted Stratos after 2min > downtime, wouldn’t it try to kill all the nodes since the last reply was > more than 60s ago? Also, snapshots would be periodic, so there’s still a > window in which cartridges might “disappear". > As you mentioned here, we will have to configure CEP to persist data when it is restarted. By default, stratos configuration doesn't support persisting data for CEP. Thanks, Reka > > Thanks a lot and best regards! > > Michiel > -- Reka Thirunavukkarasu Senior Software Engineer, WSO2, Inc.:http://wso2.com, Mobile: +94776442007
