Thanks Shaheed for the update on the failover issue..I will go through the logs and share the observation further..
Thanks, Reka On Tue, Jul 14, 2015 at 2:39 PM, Shaheedur Haque (shahhaqu) < shahh...@cisco.com> wrote: > This thread breaks out the failover issue previously reported in the > thread below. > > > > *Scenario* > > > > 1. Start 2 orchestration controllers, octl-01 and octl-02. At this > stage, Stratos comes up on octl-01. > > 2. Shortly later, launch two single VM applications (cisco-sample > and cartridge-proxy) > > 3. Some time later, launch three more single VM applications > (di-000-001 to di-000-003) > > 4. Some time later, kill octl-01 > > 5. Wait > > 6. About 60 seconds later or so, DNS switched to point Cartridge > Agents to octl-02 and then Stratos comes up on octl-02. > > > > *Expected result* > > > > · New Stratos instance picks up the old cartridges and carries on > > > > *Actual results* > > > > · The new Startos instance seems to think the old cartridges are > dead or something, and spins up new VMs. > > · All sorts of exceptions are being thrown. > > > > *Supporting info* > > > > Two tarballs attached. The “before” set was taken before step 4, the > “after” set was taken after step 6. Each contains tarballs for octl-01 and > octl-02. Thus, in the first, the wso2logs are at > > > > · tech-support-octl-01.tar, under /var/log/apache-stratos > > > > and in the second, they are at > > > > · tech-support-octl-02.tar, under /var/log/apache-stratos > > > > > > > > *From:* Shaheedur Haque (shahhaqu) > *Sent:* 07 July 2015 12:10 > *To:* dev@stratos.apache.org > *Subject:* Problem with Stratos 4.1 autohealing (was RE: Problems with > Stratos 4.1 failover and autohealing) > > > > I met with several of the wso2 team, and we agreed the following next > steps: > > > > 1. Treat the “Stratos failover” issue as separate from the > “duplicate instances” issue, and use a new thread for the former, keeping > this thread for the latter. > > 2. The lack of synchronisation between the cleanup of the faulty VMs > and the startup of the replacement VMs was noted. > > a. It was mentioned that the period of overlap should be seconds, > perhaps upto 60s. > > b. I noted that for many of our scenarios, where the hardware is 1:1 > mapped to VMs (i.e. neither over-subscribed nor under-subscribed), the > overlap may be a problem. I will consult with others and confirm. Imesh > ack’d this point. > > c. I will provide the wso2 logs for my test case (summarised below). > > d. I will check with the test team where the problem was originally > detected if this is a transient (upto 60s) or long-lived, and gather logs > as needed. > > > > *From:* Imesh Gunaratne [mailto:im...@apache.org] > *Sent:* 06 July 2015 17:11 > *To:* dev > *Subject:* Re: Problems with Stratos 4.1 failover and autohealing > > > > On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <raviha...@wso2.com > > wrote: > > > > If it detects that cartridge agent has not published any health event > within the given time window (60mins), it will publish the MemberFault > event. > > > > This should be 60 seconds. > > > > On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <raviha...@wso2.com> > wrote: > > Hi Shaheed, > > > Regarding Stratos fail-over scenario, by looking at the logs it seems that > there are multiple member fault cases detected by CEP. This > FaultHandlingWindowProcessor [1] running inside CEP is implemented to check > for health stats published by cartridge agents. It does not check whether > actual VM or processes running inside the VM is active or not. If it > detects that cartridge agent has not published any health event within the > given time window (60mins), it will publish the MemberFault event. > > Ideally this should not happen regardless of whether the deployment is HA > or not. Because FaultHandlingWindowProcessor will count the 60mins interval > from the point Stratos server started. Only reason I could think of which > may have caused these faulty member logs is Stratos (or external CEP) did > not receive health stats published by cartridge agents on time. > > > > Are you running Stratos in single-JVM mode? If so, I'd highly recommend > that you deploy the CEP profile separately. Please refer to [2] for > reference deployment architecture of Stratos HA setup. > > > > Regarding your second concern; Kill multiple Cartridges, I believe your > concern is that the first message relating to recovery of the CPS came > 66s after faulty member detection and recovery of other VMs started before > that. I'd like to summarize few points regarding Stratos recovery process > which I think will explain that behavior. > > > > - Stratos detects faulty members only by checking the health stats > published by cartridge agent and not by checking the existence of the > actual VM. > > - Upon detecting such faulty member, CEP will publish a faulty member > event and after receiving such event AS will move that instance to obsolete > list. AS will iterate over this obsolete list and try to terminate those > instances via CC. When the termination is complete, CC will publish member > terminated event, upon receiving that AS will remove that member from > obsolete list. > > - If CC fails to terminate the instance, it will keep retrying until a > obsolete member timeout interval has passed. When the timeout expires, AS > will forcefully remove that member from obsolete list. > > - Stratos will not try to re-spawn the faulty instance right after > detecting it. This recovery process is triggered by the Cluster monitor > object. There is a Drools logic which gets executed periodically (monitor > interval is configurable, default is 90s) which will execute the min-check > rule, scale-up rule and scale-down rule and decide whether to spawn new > instances. So this recovery process could take some time depending on your > monitor interval. > > > > > > So in retrospect; > > - There is no way to control which cluster monitor gets executed first, so > therefore it is possible that CPS recovery takes place after other VMs. > > - The reason for seeing lot of termination logs is because > Stratos continuously tries to terminate those faulty members until obsolete > member timeout has expired. This could occur if the actual VM was killed > manually or by other means. > > > > However, we need to figure out why you got these faulty member events. > Could you send us the CEP trace log for further analysis? Also you can > enable debug logs for FaultHandlingWindowProcessor class to get more > insights as to what's going on. > > > > I hope this would explain your concerns. We can have a Hangout session to > discuss this further. > > > > [1] > https://cwiki.apache.org/confluence/display/STRATOS/4.1.0+Configuring+HA+in+Single+JVM+Mode+on+EC2 > > [2] > https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java > > > > Thanks. > > > > On Mon, Jul 6, 2015 at 5:00 PM, Shaheedur Haque (shahhaqu) < > shahh...@cisco.com> wrote: > > Hi, > > > > I am seeing some issues around Stratos 4.1 around failover on the one hand > (i.e. stop one running instance of Startos, start another as a > replacement), and autohealing (i.e. a Cartridge dies and need to be > replaced by Stratos) which feel as though they might have a common cause, > so I am lumping them together for now. I’ll explain the two scenarios as I > can…and the attach some logs explaining one scenario in detail. > > > > *Failover of Stratos* > > > > The situation here is that I have 3 servers, all running MySQL in > active-active mode, and Stratos running on at most one of them under > Pacemaker control. The test case is that I kill the server (actually a VM) > running Stratos, and Pacemaker responds by starting Stratos on one of the > remaining servers. Around the time of the action, there is no ongoing REST > activity in terms of defining new Cartridges, or Applications or anything > like that. > > > > · Expected result: the replacement Stratos instance “acquires” the > running Cartridges, and the system continues without killing/replacing any > Cartridges. > > · Actual result: all Cartridges are kill and restarted. > > > > *Kill multiple Cartridges* > > > > I have a test where several Cartridges are killed at the same time as > follows. We have a Cartridge Proxy Server (CPS) that hosts the (Java) > Cartridge agent for VMs which cannot run one natively. The CPS therefore > runs N+1 JCAs, one for itself plus N for the Cartridges it is proxying. > Killing the CPS therefore causes all N+1 JCAs to disappear, and from a > Startos point of view, all N+1 VMs need to be restarted. > > > > · Expected result: each Cartridges is replaced by Stratos. > > · Actual result: looking at the Stratos logs, it seems that there > is a race between the termination logic which cleans up as old cluster > members dies against the startup logic which is spinning up replacement VMs > for the terminated ones. > > > > Both cases previously seemed to work fine, where we **think** previously > means versions of Stratos 4.1 dating back ~3 weeks or so. I attach a series > of log extracts for the second scenario; I suspect this also covers the > first, even if not, at least we can start here. > > > > 1. Original spin up of CPS is at 2015-06-11 19:50:20,585. CPS fault > seen around 2015-06-12 18:54:51, contemporaneously, all the other faults > are seen: > > > > ID: [0] [STRATOS] [2015-06-12 18:54:51,788] INFO > {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} - Faulty > member detected [member-id] > di-000-007.di-000-007.cisco-qvpc-sf-0.domain88ef85b5-5ec6-4d4e-b7b2-1ab9d3f7b160 > with [last time-stamp] 1434135185318 [time-out] 60000 milliseconds > > ... > > TID: [0] [STRATOS] [2015-06-12 18:54:51,803] INFO > {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} - Faulty > member detected [member-id] > di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8 > with [last time-stamp] 1434135192556 [time-out] 60000 milliseconds > > ... > > TID: [0] [STRATOS] [2015-06-12 18:54:51,837] INFO > {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} - Faulty > member detected [member-id] > cartridge-proxy.cartridge-proxy.cartridge-proxy.domain95a58a3e-e7a9-4071-b42b-3551bad25e6a > with [last time-stamp] 1434135189149 [time-out] 60000 milliseconds > > ... > > etc > > ... > > TID: [0] [STRATOS] [2015-06-12 18:54:51,862] INFO > {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} - Faulty > member detected [member-id] > di-000-010.di-000-010.cisco-qvpc-sf-0.domain2ab24c1d-53cb-452e-8aa2-f02eecf2db78 > with [last time-stamp] 1434135193054 [time-out] 60000 milliseconds > > ... > > etc > > ... > > > > 2. The last message relating to the CPS failure is at 18:54:52. The > first message relating to recovery of the CPS is some 66 seconds later: > > > > TID: [0] [STRATOS] [2015-06-12 18:55:58,516] INFO > {org.apache.stratos.autoscaler.rule.RuleLog} - [min-check] Partition > available, hence trying to spawn an instance to fulfil minimum count! > [cluster] cartridge-proxy.cartridge-proxy.cartridge-proxy.domain > > > > 3. However, recovery of other VMs started as much as 54 seconds > before this point to as much as 26 seconds after: > > > > TID: [0] [STRATOS] [2015-06-12 18:55:04,564] INFO > {org.apache.stratos.autoscaler.client.CloudControllerClient} - Trying to > spawn an instance via cloud controller: [cluster] > di-000-008.di-000-008.cisco-qvpc-sf-0.domain [partition] whole-region > [network-partition-id] RegionOne > > ... > > TID: [0] [STRATOS] [2015-06-12 18:56:20,603] INFO > {org.apache.stratos.autoscaler.client.CloudControllerClient} - Trying to > spawn an instance via cloud controller: [cluster] > di-000-007.di-000-007.cisco-qvpc-sf-0.domain [partition] whole-region > [network-partition-id] RegionOne > > > > 4. Between point 2 and 3 (with fuzz for the races), I see Stratos > cleaning the old VMs. For example: > > > > TID: [0] [STRATOS] [2015-06-12 18:55:04,580] INFO > {org.apache.stratos.cloud.controller.iaases.JcloudsIaas} - Starting to > terminate member: [cartridge-type] cisco-qvpc-sf-0 [member-id] > di-000-008.di-000-008.cisco-qvpc-sf-0.domaina0f15eed-9599-4f3e-a70a-93ddd02ccf5f > > > > 5. However, around that point, look at this: > > > > TID: [0] [STRATOS] [2015-06-12 18:55:12,835] INFO > {org.apache.stratos.autoscaler.rule.RuleLog} - [scale-up] Trying to scale > up over max, hence not scaling up cluster itself and > > notifying to parent for possible group scaling or > app bursting. > > [cluster] > di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1 > [max] 1 > > > > > > 6. Is it possible that there is a race between the terminates and > restarts? It certainly seems so, though a detailed knowledge of Startos is > needed to confirm it: > > > > TID: [0] [STRATOS] [2015-06-12 18:55:12,798] INFO > {org.apache.stratos.autoscaler.rule.RuleLog} - [min-check] Partition > available, hence trying to spawn an instance to fulfil minimum count! > [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain > > TID: [0] [STRATOS] [2015-06-12 18:55:12,798] INFO > {org.apache.stratos.autoscaler.client.CloudControllerClient} - Trying to > spawn an instance via cloud controller: [cluster] > di-000-005.di-000-005.cisco-qvpc-sf-0.domain [partition] whole-region > [network-partition-id] RegionOne > > TID: [0] [STRATOS] [2015-06-12 18:55:12,799] INFO > {org.apache.stratos.common.client.CloudControllerServiceClient} - > Terminating instance via cloud controller: [member] > di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8 > > TID: [0] [STRATOS] [2015-06-12 18:55:12,812] INFO > {org.apache.stratos.cloud.controller.messaging.publisher.TopologyEventPublisher} > - Publishing member created event: [service-name] cisco-qvpc-sf-0 > [cluster-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain > [cluster-instance-id] di-000-005-1 [member-id] > di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d > [instance-id] null [network-partition-id] RegionOne [partition-id] > whole-region [lb-cluster-id] null > > TID: [0] [STRATOS] [2015-06-12 18:55:12,817] INFO > {org.apache.stratos.messaging.message.processor.topology.MemberCreatedMessageProcessor} > - Member created: [service-name] cisco-qvpc-sf-0 [cluster-id] > di-000-005.di-000-005.cisco-qvpc-sf-0.domain [member-id] > di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d > [cluster-instance-id] di-000-005-1 > > TID: [0] [STRATOS] [2015-06-12 18:55:12,834] ERROR > {org.apache.stratos.autoscaler.rule.RuleTasksDelegator} - Request in > flight threshold is Zero > > TID: [0] [STRATOS] [2015-06-12 18:55:12,835] INFO > {org.apache.stratos.autoscaler.rule.RuleLog} - [scale-up] Trying to scale > up over max, hence not scaling up cluster itself and > > notifying to parent for possible group scaling or > app bursting. > > [cluster] > di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1 > [max] 1 > > > > (I tried to file a JIRA on this, but I’m having a bit of trouble, hence > posting here to get the ball rolling). > > > > Thanks, Shaheed > > > > > > > > > > -- > > Akila Ravihansa Perera > Software Engineer, WSO2 > > Blog: http://ravihansa3000.blogspot.com > > > > > > -- > > Imesh Gunaratne > > > > Senior Technical Lead, WSO2 > > Committer & PMC Member, Apache Stratos > -- Reka Thirunavukkarasu Senior Software Engineer, WSO2, Inc.:http://wso2.com, Mobile: +94776442007