Problem with Stratos 4.1 autohealing (was RE: Problems with Stratos 4.1 failover and autohealing)

Shaheedur Haque (shahhaqu) Tue, 07 Jul 2015 04:10:53 -0700

I met with several of the wso2 team, and we agreed the following next steps:



1.      Treat the “Stratos failover” issue as separate from the “duplicate 
instances” issue, and use a new thread for the former, keeping this thread for 
the latter.

2.      The lack of synchronisation between the cleanup of the faulty VMs and 
the startup of the replacement VMs was noted.

a.      It was mentioned that the period of overlap should be seconds, perhaps 
upto 60s.

b.      I noted that for many of our scenarios, where the hardware is 1:1 
mapped to VMs (i.e. neither over-subscribed nor under-subscribed), the overlap 
may be a problem. I will consult with others and confirm. Imesh ack’d this 
point.

c.      I will provide the wso2 logs for my test case (summarised below).

d.      I will check with the test team where the problem was originally 
detected if this is a transient  (upto 60s) or long-lived, and gather logs as 
needed.

From: Imesh Gunaratne [mailto:[email protected]]
Sent: 06 July 2015 17:11
To: dev
Subject: Re: Problems with Stratos 4.1 failover and autohealing

On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera 
<[email protected]<mailto:[email protected]>> wrote:

If it detects that cartridge agent has not published any health event within 
the given time window (60mins), it will publish the MemberFault event.

This should be 60 seconds.

On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera 
<[email protected]<mailto:[email protected]>> wrote:
Hi Shaheed,

Regarding Stratos fail-over scenario, by looking at the logs it seems that 
there are multiple member fault cases detected by CEP. This 
FaultHandlingWindowProcessor [1] running inside CEP is implemented to check for 
health stats published by cartridge agents. It does not check whether actual VM 
or processes running inside the VM is active or not. If it detects that 
cartridge agent has not published any health event within the given time window 
(60mins), it will publish the MemberFault event.
Ideally this should not happen regardless of whether the deployment is HA or 
not. Because FaultHandlingWindowProcessor will count the 60mins interval from 
the point Stratos server started. Only reason I could think of which may have 
caused these faulty member logs is Stratos (or external CEP) did not receive 
health stats published by cartridge agents on time.

Are you running Stratos in single-JVM mode? If so, I'd highly recommend that 
you deploy the CEP profile separately. Please refer to [2] for reference 
deployment architecture of Stratos HA setup.

Regarding your second concern; Kill multiple Cartridges, I believe your concern 
is that the first message relating to recovery of the CPS came 66s after faulty 
member detection and recovery of other VMs started before that. I'd like to 
summarize few points regarding Stratos recovery process which I think will 
explain that behavior.

 - Stratos detects faulty members only by checking the health stats published 
by cartridge agent and not by checking the existence of the actual VM.
 - Upon detecting such faulty member, CEP will publish a faulty member event 
and after receiving such event AS will move that instance to obsolete list. AS 
will iterate over this obsolete list and try to terminate those instances via 
CC. When the termination is complete, CC will publish member terminated event, 
upon receiving that AS will remove that member from obsolete list.
 - If CC fails to terminate the instance, it will keep retrying until a 
obsolete member timeout interval has passed. When the timeout expires, AS will 
forcefully remove that member from obsolete list.
 - Stratos will not try to re-spawn the faulty instance right after detecting 
it. This recovery process is triggered by the Cluster monitor object. There is 
a Drools logic which gets executed periodically (monitor interval is 
configurable, default is 90s) which will execute the min-check rule, scale-up 
rule and scale-down rule and decide whether to spawn new instances. So this 
recovery process could take some time depending on your monitor interval.


So in retrospect;
- There is no way to control which cluster monitor gets executed first, so 
therefore it is possible that CPS recovery takes place after other VMs.
- The reason for seeing lot of termination logs is because Stratos continuously 
tries to terminate those faulty members until obsolete member timeout has 
expired. This could occur if the actual VM was killed manually or by other 
means.

However, we need to figure out why you got these faulty member events. Could 
you send us the CEP trace log for further analysis? Also you can enable debug 
logs for FaultHandlingWindowProcessor class to get more insights as to what's 
going on.

I hope this would explain your concerns. We can have a Hangout session to 
discuss this further.

[1] 
https://cwiki.apache.org/confluence/display/STRATOS/4.1.0+Configuring+HA+in+Single+JVM+Mode+on+EC2
[2] 
https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java

Thanks.

On Mon, Jul 6, 2015 at 5:00 PM, Shaheedur Haque (shahhaqu) 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

I am seeing some issues around Stratos 4.1 around failover on the one hand 
(i.e. stop one running instance of Startos, start another as a replacement), 
and autohealing (i.e. a Cartridge dies and need to be replaced by Stratos) 
which feel as though they might have a common cause, so I am lumping them 
together for now. I’ll explain the two scenarios as I can…and the attach some 
logs explaining one scenario in detail.

Failover of Stratos

The situation here is that I have 3 servers, all running MySQL in active-active 
mode, and Stratos running on at most one of them under Pacemaker control. The 
test case is that I kill the server (actually a VM) running Stratos, and 
Pacemaker responds by starting Stratos on one of the remaining servers. Around 
the time of the action, there is no ongoing REST activity in terms of defining 
new Cartridges, or Applications or anything like that.


•        Expected result: the replacement Stratos instance “acquires” the 
running Cartridges, and the system continues without killing/replacing any 
Cartridges.

•        Actual result: all Cartridges are kill and restarted.

Kill multiple Cartridges

I have a test where several Cartridges are killed at the same time as follows. 
We have a Cartridge Proxy Server (CPS) that hosts the (Java) Cartridge agent 
for VMs which cannot run one natively. The CPS therefore runs N+1 JCAs, one for 
itself plus N for the Cartridges it is proxying. Killing the CPS therefore 
causes all N+1 JCAs to disappear, and from a Startos point of view, all N+1 VMs 
need to be restarted.


•        Expected result: each Cartridges is replaced by Stratos.

•        Actual result: looking at the Stratos logs, it seems that there is a 
race between the termination logic which cleans up as old cluster members dies 
against the startup logic which is spinning up replacement VMs for the 
terminated ones.

Both cases previously seemed to work fine, where we *think* previously means 
versions of Stratos 4.1 dating back ~3 weeks or so. I attach a series of log 
extracts for the second scenario; I suspect this also covers the first, even if 
not, at least we can start here.


1.      Original spin up of CPS is at 2015-06-11 19:50:20,585. CPS fault seen 
around 2015-06-12 18:54:51, contemporaneously, all the other faults are seen:

ID: [0] [STRATOS] [2015-06-12 18:54:51,788]  INFO 
{org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty 
member detected [member-id] 
di-000-007.di-000-007.cisco-qvpc-sf-0.domain88ef85b5-5ec6-4d4e-b7b2-1ab9d3f7b160
 with [last time-stamp] 1434135185318 [time-out] 60000 milliseconds
...
TID: [0] [STRATOS] [2015-06-12 18:54:51,803]  INFO 
{org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty 
member detected [member-id] 
di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
 with [last time-stamp] 1434135192556 [time-out] 60000 milliseconds
...
TID: [0] [STRATOS] [2015-06-12 18:54:51,837]  INFO 
{org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty 
member detected [member-id] 
cartridge-proxy.cartridge-proxy.cartridge-proxy.domain95a58a3e-e7a9-4071-b42b-3551bad25e6a
 with [last time-stamp] 1434135189149 [time-out] 60000 milliseconds
...
etc
...
TID: [0] [STRATOS] [2015-06-12 18:54:51,862]  INFO 
{org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty 
member detected [member-id] 
di-000-010.di-000-010.cisco-qvpc-sf-0.domain2ab24c1d-53cb-452e-8aa2-f02eecf2db78
 with [last time-stamp] 1434135193054 [time-out] 60000 milliseconds
...
etc
...


2.      The last message relating to the CPS failure is at 18:54:52. The first 
message relating to recovery of the CPS is some 66 seconds later:

TID: [0] [STRATOS] [2015-06-12 18:55:58,516]  INFO 
{org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition 
available, hence trying to spawn an instance to fulfil minimum count! [cluster] 
cartridge-proxy.cartridge-proxy.cartridge-proxy.domain


3.      However, recovery of other VMs started as much as 54 seconds before 
this point to as much as 26 seconds after:

TID: [0] [STRATOS] [2015-06-12 18:55:04,564]  INFO 
{org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to spawn 
an instance via cloud controller: [cluster] 
di-000-008.di-000-008.cisco-qvpc-sf-0.domain [partition] whole-region 
[network-partition-id] RegionOne
...
TID: [0] [STRATOS] [2015-06-12 18:56:20,603]  INFO 
{org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to spawn 
an instance via cloud controller: [cluster] 
di-000-007.di-000-007.cisco-qvpc-sf-0.domain [partition] whole-region 
[network-partition-id] RegionOne


4.      Between point 2 and 3 (with fuzz for the races), I see Stratos cleaning 
the old VMs. For example:

TID: [0] [STRATOS] [2015-06-12 18:55:04,580]  INFO 
{org.apache.stratos.cloud.controller.iaases.JcloudsIaas} -  Starting to 
terminate member: [cartridge-type] cisco-qvpc-sf-0 [member-id] 
di-000-008.di-000-008.cisco-qvpc-sf-0.domaina0f15eed-9599-4f3e-a70a-93ddd02ccf5f


5.      However, around that point, look at this:

TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO 
{org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale up 
over max, hence not scaling up cluster itself and
                        notifying to parent for possible group scaling or app 
bursting.
                        [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain 
[instance id]di-000-005-1 [max] 1



6.      Is it possible that there is a race between the terminates and 
restarts? It certainly seems so, though a detailed knowledge of Startos is 
needed to confirm it:

TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO 
{org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition 
available, hence trying to spawn an instance to fulfil minimum count! [cluster] 
di-000-005.di-000-005.cisco-qvpc-sf-0.domain
TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO 
{org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to spawn 
an instance via cloud controller: [cluster] 
di-000-005.di-000-005.cisco-qvpc-sf-0.domain [partition] whole-region 
[network-partition-id] RegionOne
TID: [0] [STRATOS] [2015-06-12 18:55:12,799]  INFO 
{org.apache.stratos.common.client.CloudControllerServiceClient} -  Terminating 
instance via cloud controller: [member] 
di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
TID: [0] [STRATOS] [2015-06-12 18:55:12,812]  INFO 
{org.apache.stratos.cloud.controller.messaging.publisher.TopologyEventPublisher}
 -  Publishing member created event: [service-name] cisco-qvpc-sf-0 
[cluster-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain [cluster-instance-id] 
di-000-005-1 [member-id] 
di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
 [instance-id] null [network-partition-id] RegionOne [partition-id] 
whole-region [lb-cluster-id] null
TID: [0] [STRATOS] [2015-06-12 18:55:12,817]  INFO 
{org.apache.stratos.messaging.message.processor.topology.MemberCreatedMessageProcessor}
 -  Member created: [service-name] cisco-qvpc-sf-0 [cluster-id] 
di-000-005.di-000-005.cisco-qvpc-sf-0.domain [member-id] 
di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
 [cluster-instance-id] di-000-005-1
TID: [0] [STRATOS] [2015-06-12 18:55:12,834] ERROR 
{org.apache.stratos.autoscaler.rule.RuleTasksDelegator} -  Request in flight 
threshold is Zero
TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO 
{org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale up 
over max, hence not scaling up cluster itself and
                        notifying to parent for possible group scaling or app 
bursting.
                        [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain 
[instance id]di-000-005-1 [max] 1

(I tried to file a JIRA on this, but I’m having a bit of trouble, hence posting 
here to get the ball rolling).

Thanks, Shaheed





--
Akila Ravihansa Perera
Software Engineer, WSO2

Blog: http://ravihansa3000.blogspot.com



--
Imesh Gunaratne

Senior Technical Lead, WSO2
Committer & PMC Member, Apache Stratos

Problem with Stratos 4.1 autohealing (was RE: Problems with Stratos 4.1 failover and autohealing)

Reply via email to