Hi all, Basically, I was stopping and starting Stratos and looking at how it handled dying cartridges, and found that Stratos only detected cartridge deaths while it was running..
The problem In steady state, I have some cartridges managed by Stratos, ./stratos.sh list-subscribed-cartridges | grep samp | cisco-sample-vm | cisco-sample-vm | 1 | Single-Tenant | cisco-sample-vm | Active | 1 | cisco-sample-vm.foo.cisco.com | nova list | grep samp | 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE | None | Running | core=172.16.2.17, 10.86.205.231 | All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample cartridge, and then start ActiveMQ and Stratos again. Now, at first things look good..: ./stratos.sh list-subscribed-cartridges | grep samp | cisco-sample-vm | cisco-sample-vm | 1 | Single-Tenant | cisco-sample-vm | Inactive | 0 | cisco-sample-vm.foo.cisco.com | But then, root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh list-subscribed-cartridges | grep samp | cisco-sample-vm | cisco-sample-vm | 1 | Single-Tenant | cisco-sample-vm | Active | 1 | cisco-sample-vm.foo.cisco.com | # nova list | grep samp # How did the cartridge become active without it actually being there? As far as I can tell, Stratos never recovers from this. I found this bug here: https://issues.apache.org/jira/browse/STRATOS-234 - is this describing the issue I’m seeing? I was a little bit confused by the usage of the word “obsolete”. Where to go next? Now, I’ve done a little bit of digging, but I don’t yet have a full mental model of how everything fits together in Stratos - please could someone help me put the pieces together? :) What I’m seeing is the following: - The cluster monitor appears to be active: TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG {org.apache.stratos.autoscaler.monitor.ClusterMonitor} - Cluster monitor is running.. Cluste rMonitor [clusterId=cisco-sample-vm.cisco-sample-v, serviceId=cisco-sample-vm, deploymentPolicy=Deployment Policy [id]static-1 [partitions] [org .apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0], autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null, de scription=null], lbReferenceType=null] {org.apache.stratos.autoscaler.monitor.ClusterMonitor} - It looks like the CEP FaultHandlingWindowProcessor usually detects inactive members. However, since this member was never active, the timeStampMap doesn’t contain an element for this member, so it’s never checked. - I think the fault handling is triggered by a fault_message, but I didn’t manage to figure out where it’s coming from. Does anyone know what triggers it? (is it the CEP extension?) Anyway.. Questions - How should Stratos detect after some downtime which cartridges are still there and which ones aren’t? (what was the intended design?) - Why did the missing cartridge go “active”? Is this a result from restoring persistent state? (If I look in the registry I can see stuff under subscriptions/active, but not sure if that’s where it comes from) - Who should be responsible for detecting the absence of an instance - the ClusterMonitor? That seems to be fed incorrect data, since it clearly thinks there are enough instances running. Which component has the necessary data? - It looks like it’s possible to snapshot CEP state to make it semi-persistent. However, if I restarted Stratos after 2min downtime, wouldn’t it try to kill all the nodes since the last reply was more than 60s ago? Also, snapshots would be periodic, so there’s still a window in which cartridges might “disappear". Thanks a lot and best regards! Michiel
signature.asc
Description: Message signed with OpenPGP using GPGMail