Hi,
I am seeing some issues around Stratos 4.1 around failover on the one hand
(i.e. stop one running instance of Startos, start another as a replacement),
and autohealing (i.e. a Cartridge dies and need to be replaced by Stratos)
which feel as though they might have a common cause, so I am lumping them
together for now. I'll explain the two scenarios as I can...and the attach some
logs explaining one scenario in detail.
Failover of Stratos
The situation here is that I have 3 servers, all running MySQL in active-active
mode, and Stratos running on at most one of them under Pacemaker control. The
test case is that I kill the server (actually a VM) running Stratos, and
Pacemaker responds by starting Stratos on one of the remaining servers. Around
the time of the action, there is no ongoing REST activity in terms of defining
new Cartridges, or Applications or anything like that.
* Expected result: the replacement Stratos instance "acquires" the
running Cartridges, and the system continues without killing/replacing any
Cartridges.
* Actual result: all Cartridges are kill and restarted.
Kill multiple Cartridges
I have a test where several Cartridges are killed at the same time as follows.
We have a Cartridge Proxy Server (CPS) that hosts the (Java) Cartridge agent
for VMs which cannot run one natively. The CPS therefore runs N+1 JCAs, one for
itself plus N for the Cartridges it is proxying. Killing the CPS therefore
causes all N+1 JCAs to disappear, and from a Startos point of view, all N+1 VMs
need to be restarted.
* Expected result: each Cartridges is replaced by Stratos.
* Actual result: looking at the Stratos logs, it seems that there is a
race between the termination logic which cleans up as old cluster members dies
against the startup logic which is spinning up replacement VMs for the
terminated ones.
Both cases previously seemed to work fine, where we *think* previously means
versions of Stratos 4.1 dating back ~3 weeks or so. I attach a series of log
extracts for the second scenario; I suspect this also covers the first, even if
not, at least we can start here.
1. Original spin up of CPS is at 2015-06-11 19:50:20,585. CPS fault seen
around 2015-06-12 18:54:51, contemporaneously, all the other faults are seen:
ID: [0] [STRATOS] [2015-06-12 18:54:51,788] INFO
{org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} - Faulty
member detected [member-id]
di-000-007.di-000-007.cisco-qvpc-sf-0.domain88ef85b5-5ec6-4d4e-b7b2-1ab9d3f7b160
with [last time-stamp] 1434135185318 [time-out] 60000 milliseconds
...
TID: [0] [STRATOS] [2015-06-12 18:54:51,803] INFO
{org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} - Faulty
member detected [member-id]
di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
with [last time-stamp] 1434135192556 [time-out] 60000 milliseconds
...
TID: [0] [STRATOS] [2015-06-12 18:54:51,837] INFO
{org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} - Faulty
member detected [member-id]
cartridge-proxy.cartridge-proxy.cartridge-proxy.domain95a58a3e-e7a9-4071-b42b-3551bad25e6a
with [last time-stamp] 1434135189149 [time-out] 60000 milliseconds
...
etc
...
TID: [0] [STRATOS] [2015-06-12 18:54:51,862] INFO
{org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} - Faulty
member detected [member-id]
di-000-010.di-000-010.cisco-qvpc-sf-0.domain2ab24c1d-53cb-452e-8aa2-f02eecf2db78
with [last time-stamp] 1434135193054 [time-out] 60000 milliseconds
...
etc
...
2. The last message relating to the CPS failure is at 18:54:52. The first
message relating to recovery of the CPS is some 66 seconds later:
TID: [0] [STRATOS] [2015-06-12 18:55:58,516] INFO
{org.apache.stratos.autoscaler.rule.RuleLog} - [min-check] Partition
available, hence trying to spawn an instance to fulfil minimum count! [cluster]
cartridge-proxy.cartridge-proxy.cartridge-proxy.domain
3. However, recovery of other VMs started as much as 54 seconds before
this point to as much as 26 seconds after:
TID: [0] [STRATOS] [2015-06-12 18:55:04,564] INFO
{org.apache.stratos.autoscaler.client.CloudControllerClient} - Trying to spawn
an instance via cloud controller: [cluster]
di-000-008.di-000-008.cisco-qvpc-sf-0.domain [partition] whole-region
[network-partition-id] RegionOne
...
TID: [0] [STRATOS] [2015-06-12 18:56:20,603] INFO
{org.apache.stratos.autoscaler.client.CloudControllerClient} - Trying to spawn
an instance via cloud controller: [cluster]
di-000-007.di-000-007.cisco-qvpc-sf-0.domain [partition] whole-region
[network-partition-id] RegionOne
4. Between point 2 and 3 (with fuzz for the races), I see Stratos cleaning
the old VMs. For example:
TID: [0] [STRATOS] [2015-06-12 18:55:04,580] INFO
{org.apache.stratos.cloud.controller.iaases.JcloudsIaas} - Starting to
terminate member: [cartridge-type] cisco-qvpc-sf-0 [member-id]
di-000-008.di-000-008.cisco-qvpc-sf-0.domaina0f15eed-9599-4f3e-a70a-93ddd02ccf5f
5. However, around that point, look at this:
TID: [0] [STRATOS] [2015-06-12 18:55:12,835] INFO
{org.apache.stratos.autoscaler.rule.RuleLog} - [scale-up] Trying to scale up
over max, hence not scaling up cluster itself and
notifying to parent for possible group scaling or app
bursting.
[cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
[instance id]di-000-005-1 [max] 1
6. Is it possible that there is a race between the terminates and
restarts? It certainly seems so, though a detailed knowledge of Startos is
needed to confirm it:
TID: [0] [STRATOS] [2015-06-12 18:55:12,798] INFO
{org.apache.stratos.autoscaler.rule.RuleLog} - [min-check] Partition
available, hence trying to spawn an instance to fulfil minimum count! [cluster]
di-000-005.di-000-005.cisco-qvpc-sf-0.domain
TID: [0] [STRATOS] [2015-06-12 18:55:12,798] INFO
{org.apache.stratos.autoscaler.client.CloudControllerClient} - Trying to spawn
an instance via cloud controller: [cluster]
di-000-005.di-000-005.cisco-qvpc-sf-0.domain [partition] whole-region
[network-partition-id] RegionOne
TID: [0] [STRATOS] [2015-06-12 18:55:12,799] INFO
{org.apache.stratos.common.client.CloudControllerServiceClient} - Terminating
instance via cloud controller: [member]
di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
TID: [0] [STRATOS] [2015-06-12 18:55:12,812] INFO
{org.apache.stratos.cloud.controller.messaging.publisher.TopologyEventPublisher}
- Publishing member created event: [service-name] cisco-qvpc-sf-0
[cluster-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain [cluster-instance-id]
di-000-005-1 [member-id]
di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
[instance-id] null [network-partition-id] RegionOne [partition-id]
whole-region [lb-cluster-id] null
TID: [0] [STRATOS] [2015-06-12 18:55:12,817] INFO
{org.apache.stratos.messaging.message.processor.topology.MemberCreatedMessageProcessor}
- Member created: [service-name] cisco-qvpc-sf-0 [cluster-id]
di-000-005.di-000-005.cisco-qvpc-sf-0.domain [member-id]
di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
[cluster-instance-id] di-000-005-1
TID: [0] [STRATOS] [2015-06-12 18:55:12,834] ERROR
{org.apache.stratos.autoscaler.rule.RuleTasksDelegator} - Request in flight
threshold is Zero
TID: [0] [STRATOS] [2015-06-12 18:55:12,835] INFO
{org.apache.stratos.autoscaler.rule.RuleLog} - [scale-up] Trying to scale up
over max, hence not scaling up cluster itself and
notifying to parent for possible group scaling or app
bursting.
[cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
[instance id]di-000-005-1 [max] 1
(I tried to file a JIRA on this, but I'm having a bit of trouble, hence posting
here to get the ball rolling).
Thanks, Shaheed