Re: Problem with Stratos 4.1 autohealing (was RE: Problems with Stratos 4.1 failover and autohealing)

Reka Thirunavukkarasu Mon, 13 Jul 2015 22:12:08 -0700

Hi Shaheed,

Shall we have a call to discuss on the issue further?


Thanks,
Reka

On Mon, Jul 13, 2015 at 7:04 PM, Shaheedur Haque (shahhaqu) <
[email protected]> wrote:

>  Hi Reka,
>
>
>
> First, this test was likely done with Stratos from about a month ago.
>
>
>
> Second, as discussed, the test scenario here was that on 12-Jun, around
> 18:53, the “cartridge-proxy” server was killed. Since the cartridge-proxy
> is a Cartridge (with its own Java Cartridge Agent) which also hosts the
> Cartridge Agents for other VMs, then it was killed, Stratos will detect *
> *all** the VMs as down (i.e the cartridge-proxy as well as the ones
> associated with the extra Cartridge Agents). Thus, from a Stratos point of
> view we are expecting lots of member termination activity followed by
> recovery activity.
>
>
>
> As per the call, the problem here is that there is a race between the two,
> and sometimes we end up with some subscriptions (each is a single Cartridge
> Application, with max=1) having multiple VMs instead of the max=1 setting
> being honoured.
>
>
>
> (I’m not sure I understand the significance of the logs you quoted from
> 15-Jun, but I am assuming that is not directly related to these events).
>
>
>
> Hope that clears it up, if not, let’s do a call to sync up.
>
>
>
> Thanks, Shaheed
>
>
>
> *From:* Reka Thirunavukkarasu [mailto:[email protected]]
> *Sent:* 13 July 2015 13:22
> *To:* dev
> *Subject:* Re: Problem with Stratos 4.1 autohealing (was RE: Problems
> with Stratos 4.1 failover and autohealing)
>
>
>
> Hi Shaheed,
>
> One more observation as below. This could be the reason why there were
> lots of faulty member detection in the CEP side as the member didn't get
> actually terminated from the IaaS or topology rather it just got removed
> from the obsolete list as the obsolete timeout exceeded. Autoscaler should
> keep the member into obsolete list until 1 day as obsolete timeout set to
> 1 day. It is unlikely that the member got removed within few seconds. As
> i have tested with latest code base with 4.1.0-rc4, i could see that the
> autoscaler keeps the obsolete member until 1 day. Can you let us know
> whether you are testing with latest stratos build? Because, we have already
> fixed an issue of terminating the instance once before removing it from the
> obsolete list.
>
>
>
> Fault member detection
> -------------------------------
> TID: [0] [STRATOS] [2015-06-15 19:47:20,960]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> di-000-009.di-000-009.cisco-qvpc-sf-0.domainc8ade134-ff57-4c02-a546-856e7197d78c
> with [last time-stamp] 1434397528340 [time-out] 60000 milliseconds
>
>
> Obsolete member removal
> -----------------------------------
>
> TID: [0] [STRATOS] [2015-06-15 19:47:35,525]  INFO
> {org.apache.stratos.autoscaler.context.partition.ClusterLevelPartitionContext$ObsoletedMemberWatcher}
> -  Obsolete state of member is expired, member will be disposed and will
> not be tracked anymore [obsolete member]
> di-000-009.di-000-009.cisco-qvpc-sf-0.domainc8ade134-ff57-4c02-a546-856e7197d78c
> [expiry time] 86400000 [cluster]
> di-000-009.di-000-009.cisco-qvpc-sf-0.domain [cluster instance] di-000-009-1
>
> If you encounter this again, can you enable debug logs for autscaler and
> share the logs with us?
>
> Thanks,
>
> Reka
>
>
>
>
> On Mon, Jul 13, 2015 at 4:45 PM, Reka Thirunavukkarasu <[email protected]>
> wrote:
>
> Hi Shaheed,
>
> As i went though the logs "*duplicated cartridges*" case, the observation
> is actually correct and it is what expected from stratos side as well. The
> expected flow is as below currently as Akila explained. I have added a
> point where the termination and spawning can happen at the same time:
>
> 1. Faulty member is detected and received by autoscaler
>
> 2. Autoscaler immediately move this from active list to obsolete list
>
> 3. Cluster monitor periodic task executes minimum drool, then scaling
> drool and obsolete drool as parallel to other drools.
>
> 4. There is a chance that while obsolete drool is terminating the
> instance, minimum drool can start a new instance as both are executed in
> two different thread.
>
> 5. After minimum drool got executed, scaling drool will get executed. In
> that according to the cluster based stats, it can take any decision to
> scaleup/down.
>
> Please let us know whether this behavior has any impact on your scenarios.
>
> Thanks,
>
> Reka
>
>
>
>
>
>
>
> On Tue, Jul 7, 2015 at 4:39 PM, Shaheedur Haque (shahhaqu) <
> [email protected]> wrote:
>
> I met with several of the wso2 team, and we agreed the following next
> steps:
>
>
>
> 1.      Treat the “Stratos failover” issue as separate from the
> “duplicate instances” issue, and use a new thread for the former, keeping
> this thread for the latter.
>
> 2.      The lack of synchronisation between the cleanup of the faulty VMs
> and the startup of the replacement VMs was noted.
>
> a.      It was mentioned that the period of overlap should be seconds,
> perhaps upto 60s.
>
> b.      I noted that for many of our scenarios, where the hardware is 1:1
> mapped to VMs (i.e. neither over-subscribed nor under-subscribed), the
> overlap may be a problem. I will consult with others and confirm. Imesh
> ack’d this point.
>
> c.      I will provide the wso2 logs for my test case (summarised below).
>
> d.      I will check with the test team where the problem was originally
> detected if this is a transient  (upto 60s) or long-lived, and gather logs
> as needed.
>
>
>
> *From:* Imesh Gunaratne [mailto:[email protected]]
> *Sent:* 06 July 2015 17:11
> *To:* dev
> *Subject:* Re: Problems with Stratos 4.1 failover and autohealing
>
>
>
> On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <[email protected]
> > wrote:
>
>
>
>  If it detects that cartridge agent has not published any health event
> within the given time window (60mins), it will publish the MemberFault
> event.
>
>
>
>  This should be 60 seconds.
>
>
>
> On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <[email protected]>
> wrote:
>
> Hi Shaheed,
>
>
> Regarding Stratos fail-over scenario, by looking at the logs it seems that
> there are multiple member fault cases detected by CEP. This
> FaultHandlingWindowProcessor [1] running inside CEP is implemented to check
> for health stats published by cartridge agents. It does not check whether
> actual VM or processes running inside the VM is active or not. If it
> detects that cartridge agent has not published any health event within the
> given time window (60mins), it will publish the MemberFault event.
>
> Ideally this should not happen regardless of whether the deployment is HA
> or not. Because FaultHandlingWindowProcessor will count the 60mins interval
> from the point Stratos server started. Only reason I could think of which
> may have caused these faulty member logs is Stratos (or external CEP) did
> not receive health stats published by cartridge agents on time.
>
>
>
> Are you running Stratos in single-JVM mode? If so, I'd highly recommend
> that you deploy the CEP profile separately. Please refer to [2] for
> reference deployment architecture of Stratos HA setup.
>
>
>
> Regarding your second concern; Kill multiple Cartridges, I believe your
> concern is that the first message relating to recovery of the CPS came
> 66s after faulty member detection and recovery of other VMs started before
> that. I'd like to summarize few points regarding Stratos recovery process
> which I think will explain that behavior.
>
>
>
>  - Stratos detects faulty members only by checking the health stats
> published by cartridge agent and not by checking the existence of the
> actual VM.
>
>  - Upon detecting such faulty member, CEP will publish a faulty member
> event and after receiving such event AS will move that instance to obsolete
> list. AS will iterate over this obsolete list and try to terminate those
> instances via CC. When the termination is complete, CC will publish member
> terminated event, upon receiving that AS will remove that member from
> obsolete list.
>
>  - If CC fails to terminate the instance, it will keep retrying until a
> obsolete member timeout interval has passed. When the timeout expires, AS
> will forcefully remove that member from obsolete list.
>
>  - Stratos will not try to re-spawn the faulty instance right after
> detecting it. This recovery process is triggered by the Cluster monitor
> object. There is a Drools logic which gets executed periodically (monitor
> interval is configurable, default is 90s) which will execute the min-check
> rule, scale-up rule and scale-down rule and decide whether to spawn new
> instances. So this recovery process could take some time depending on your
> monitor interval.
>
>
>
>
>
> So in retrospect;
>
> - There is no way to control which cluster monitor gets executed first, so
> therefore it is possible that CPS recovery takes place after other VMs.
>
> - The reason for seeing lot of termination logs is because
> Stratos continuously tries to terminate those faulty members until obsolete
> member timeout has expired. This could occur if the actual VM was killed
> manually or by other means.
>
>
>
> However, we need to figure out why you got these faulty member events.
> Could you send us the CEP trace log for further analysis? Also you can
> enable debug logs for FaultHandlingWindowProcessor class to get more
> insights as to what's going on.
>
>
>
> I hope this would explain your concerns. We can have a Hangout session to
> discuss this further.
>
>
>
> [1]
> https://cwiki.apache.org/confluence/display/STRATOS/4.1.0+Configuring+HA+in+Single+JVM+Mode+on+EC2
>
> [2]
> https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java
>
>
>
> Thanks.
>
>
>
> On Mon, Jul 6, 2015 at 5:00 PM, Shaheedur Haque (shahhaqu) <
> [email protected]> wrote:
>
> Hi,
>
>
>
> I am seeing some issues around Stratos 4.1 around failover on the one hand
> (i.e. stop one running instance of Startos, start another as a
> replacement), and autohealing (i.e. a Cartridge dies and need to be
> replaced by Stratos) which feel as though they might have a common cause,
> so I am lumping them together for now. I’ll explain the two scenarios as I
> can…and the attach some logs explaining one scenario in detail.
>
>
>
> *Failover of Stratos*
>
>
>
> The situation here is that I have 3 servers, all running MySQL in
> active-active mode, and Stratos running on at most one of them under
> Pacemaker control. The test case is that I kill the server (actually a VM)
> running Stratos, and Pacemaker responds by starting Stratos on one of the
> remaining servers. Around the time of the action, there is no ongoing REST
> activity in terms of defining new Cartridges, or Applications or anything
> like that.
>
>
>
> ·        Expected result: the replacement Stratos instance “acquires” the
> running Cartridges, and the system continues without killing/replacing any
> Cartridges.
>
> ·        Actual result: all Cartridges are kill and restarted.
>
>
>
> *Kill multiple Cartridges*
>
>
>
> I have a test where several Cartridges are killed at the same time as
> follows. We have a Cartridge Proxy Server (CPS) that hosts the (Java)
> Cartridge agent for VMs which cannot run one natively. The CPS therefore
> runs N+1 JCAs, one for itself plus N for the Cartridges it is proxying.
> Killing the CPS therefore causes all N+1 JCAs to disappear, and from a
> Startos point of view, all N+1 VMs need to be restarted.
>
>
>
> ·        Expected result: each Cartridges is replaced by Stratos.
>
> ·        Actual result: looking at the Stratos logs, it seems that there
> is a race between the termination logic which cleans up as old cluster
> members dies against the startup logic which is spinning up replacement VMs
> for the terminated ones.
>
>
>
> Both cases previously seemed to work fine, where we **think** previously
> means versions of Stratos 4.1 dating back ~3 weeks or so. I attach a series
> of log extracts for the second scenario; I suspect this also covers the
> first, even if not, at least we can start here.
>
>
>
> 1.      Original spin up of CPS is at 2015-06-11 19:50:20,585. CPS fault
> seen around 2015-06-12 18:54:51, contemporaneously, all the other faults
> are seen:
>
>
>
> ID: [0] [STRATOS] [2015-06-12 18:54:51,788]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> di-000-007.di-000-007.cisco-qvpc-sf-0.domain88ef85b5-5ec6-4d4e-b7b2-1ab9d3f7b160
> with [last time-stamp] 1434135185318 [time-out] 60000 milliseconds
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:54:51,803]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
> with [last time-stamp] 1434135192556 [time-out] 60000 milliseconds
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:54:51,837]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> cartridge-proxy.cartridge-proxy.cartridge-proxy.domain95a58a3e-e7a9-4071-b42b-3551bad25e6a
> with [last time-stamp] 1434135189149 [time-out] 60000 milliseconds
>
> ...
>
> etc
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:54:51,862]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> di-000-010.di-000-010.cisco-qvpc-sf-0.domain2ab24c1d-53cb-452e-8aa2-f02eecf2db78
> with [last time-stamp] 1434135193054 [time-out] 60000 milliseconds
>
> ...
>
> etc
>
> ...
>
>
>
> 2.      The last message relating to the CPS failure is at 18:54:52. The
> first message relating to recovery of the CPS is some 66 seconds later:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:58,516]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition
> available, hence trying to spawn an instance to fulfil minimum count!
> [cluster] cartridge-proxy.cartridge-proxy.cartridge-proxy.domain
>
>
>
> 3.      However, recovery of other VMs started as much as 54 seconds
> before this point to as much as 26 seconds after:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:04,564]  INFO
> {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to
> spawn an instance via cloud controller: [cluster]
> di-000-008.di-000-008.cisco-qvpc-sf-0.domain [partition] whole-region
> [network-partition-id] RegionOne
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:56:20,603]  INFO
> {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to
> spawn an instance via cloud controller: [cluster]
> di-000-007.di-000-007.cisco-qvpc-sf-0.domain [partition] whole-region
> [network-partition-id] RegionOne
>
>
>
> 4.      Between point 2 and 3 (with fuzz for the races), I see Stratos
> cleaning the old VMs. For example:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:04,580]  INFO
> {org.apache.stratos.cloud.controller.iaases.JcloudsIaas} -  Starting to
> terminate member: [cartridge-type] cisco-qvpc-sf-0 [member-id]
> di-000-008.di-000-008.cisco-qvpc-sf-0.domaina0f15eed-9599-4f3e-a70a-93ddd02ccf5f
>
>
>
> 5.      However, around that point, look at this:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale
> up over max, hence not scaling up cluster itself and
>
>                         notifying to parent for possible group scaling or
> app bursting.
>
>                         [cluster]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1
> [max] 1
>
>
>
>
>
> 6.      Is it possible that there is a race between the terminates and
> restarts? It certainly seems so, though a detailed knowledge of Startos is
> needed to confirm it:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition
> available, hence trying to spawn an instance to fulfil minimum count!
> [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO
> {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to
> spawn an instance via cloud controller: [cluster]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [partition] whole-region
> [network-partition-id] RegionOne
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,799]  INFO
> {org.apache.stratos.common.client.CloudControllerServiceClient} -
> Terminating instance via cloud controller: [member]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,812]  INFO
> {org.apache.stratos.cloud.controller.messaging.publisher.TopologyEventPublisher}
> -  Publishing member created event: [service-name] cisco-qvpc-sf-0
> [cluster-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
> [cluster-instance-id] di-000-005-1 [member-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
> [instance-id] null [network-partition-id] RegionOne [partition-id]
> whole-region [lb-cluster-id] null
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,817]  INFO
> {org.apache.stratos.messaging.message.processor.topology.MemberCreatedMessageProcessor}
> -  Member created: [service-name] cisco-qvpc-sf-0 [cluster-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [member-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
> [cluster-instance-id] di-000-005-1
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,834] ERROR
> {org.apache.stratos.autoscaler.rule.RuleTasksDelegator} -  Request in
> flight threshold is Zero
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale
> up over max, hence not scaling up cluster itself and
>
>                         notifying to parent for possible group scaling or
> app bursting.
>
>                         [cluster]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1
> [max] 1
>
>
>
> (I tried to file a JIRA on this, but I’m having a bit of trouble, hence
> posting here to get the ball rolling).
>
>
>
> Thanks, Shaheed
>
>
>
>
>
>
>
>
>
> --
>
> Akila Ravihansa Perera
> Software Engineer, WSO2
>
> Blog: http://ravihansa3000.blogspot.com
>
>
>
>
>
> --
>
> Imesh Gunaratne
>
>
>
> Senior Technical Lead, WSO2
>
> Committer & PMC Member, Apache Stratos
>
>
>
>   --
>
> Reka Thirunavukkarasu
> Senior Software Engineer,
> WSO2, Inc.:http://wso2.com,
>
> Mobile: +94776442007
>
>
>
>
>
>
> --
>
> Reka Thirunavukkarasu
> Senior Software Engineer,
> WSO2, Inc.:http://wso2.com,
>
> Mobile: +94776442007
>
>
>



-- 
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007

Re: Problem with Stratos 4.1 autohealing (was RE: Problems with Stratos 4.1 failover and autohealing)

Reply via email to