stop stratos, kill cartridge instance, start stratos -> cartridge still active?

Michiel Blokzijl (mblokzij) Mon, 23 Jun 2014 09:24:26 -0700

Hi all,

Basically, I was stopping and starting Stratos and looking at how it handled 
dying cartridges, and found that Stratos only detected cartridge deaths while 
it was running..


The problem
In steady state, I have some cartridges managed by Stratos, 

./stratos.sh list-subscribed-cartridges | grep samp
| cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant | cisco-sample-vm 
| Active | 1                 | cisco-sample-vm.foo.cisco.com |

nova list | grep samp
| 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE | None       | 
Running     | core=172.16.2.17, 10.86.205.231  |

All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample cartridge, 
and then start ActiveMQ and Stratos again.

Now, at first things look good..:

./stratos.sh list-subscribed-cartridges | grep samp
| cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant | cisco-sample-vm 
| Inactive | 0                 | cisco-sample-vm.foo.cisco.com |

But then,

root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh 
list-subscribed-cartridges | grep samp
| cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant | cisco-sample-vm 
| Active | 1                 | cisco-sample-vm.foo.cisco.com |

# nova list | grep samp
# 

How did the cartridge become active without it actually being there? As far as 
I can tell, Stratos never recovers from this.

I found this bug here: https://issues.apache.org/jira/browse/STRATOS-234 - is 
this describing the issue I’m seeing? I was a little bit confused by the usage 
of the word “obsolete”.

Where to go next?
Now, I’ve done a little bit of digging, but I don’t yet have a full mental 
model of how everything fits together in Stratos - please could someone help me 
put the pieces together? :)

What I’m seeing is the following:
- The cluster monitor appears to be active:

TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG 
{org.apache.stratos.autoscaler.monitor.ClusterMonitor} -  Cluster monitor is 
running.. Cluste
rMonitor [clusterId=cisco-sample-vm.cisco-sample-v, serviceId=cisco-sample-vm, 
deploymentPolicy=Deployment Policy [id]static-1 [partitions] [org
.apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0], 
autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null, de
scription=null], lbReferenceType=null] 
{org.apache.stratos.autoscaler.monitor.ClusterMonitor}

- It looks like the CEP FaultHandlingWindowProcessor usually detects inactive 
members. However, since this member was never active, the timeStampMap doesn’t 
contain an element for this member, so it’s never checked.
- I think the fault handling is triggered by a fault_message, but I didn’t 
manage to figure out where it’s coming from. Does anyone know what triggers it? 
(is it the CEP extension?)

Anyway.. 

Questions
- How should Stratos detect after some downtime which cartridges are still 
there and which ones aren’t? (what was the intended design?)
- Why did the missing cartridge go “active”? Is this a result from restoring 
persistent state? (If I look in the registry I can see stuff under 
subscriptions/active, but not sure if that’s where it comes from)
- Who should be responsible for detecting the absence of an instance - the 
ClusterMonitor? That seems to be fed incorrect data, since it clearly thinks 
there are enough instances running. Which component has the necessary data?
- It looks like it’s possible to snapshot CEP state to make it semi-persistent. 
However, if I restarted Stratos after 2min downtime, wouldn’t it try to kill 
all the nodes since the last reply was more than 60s ago? Also, snapshots would 
be periodic, so there’s still a window in which cartridges might “disappear".

Thanks a lot and best regards!

Michiel

signature.asc
Description: Message signed with OpenPGP using GPGMail

stop stratos, kill cartridge instance, start stratos -> cartridge still active?

Reply via email to