Hi Shaheed,
As i went through the logs(wso2carbon.log and wso2-cep-trace.log) of
octl-02-after, identified that there is a possibility for the member fault
to occur as cartridge agent didn't publish the events for more than one
minute from 00:53:24 to 00:55:09 ([2], [3]) which is obviously more than
one minute where the cartridge agent health publishing interval is 15s
([1], [2]). I observed this sequence([1], [2], [3]) in the
wso2-cep-trace.log. It has happened intermittently. Please see below for
the details. Member fault occur[4] in stratos side at 00:54:34 as the last
event from cartridge agent was marked as 00:53:24 [2]. The next event from
cartridge agent received at 00:55:09. So, we will need to investigate at
the cartridge agent side why it failed sent the health stats intermittently
even after everything back to normal as i could see that cartridge agent
was publishing health stat after octl-02 came without any issue for nearly
two minutes Eg: [1] and [2]
Is it possible for you to get the cartridge agent logs? I know that it is
little bit hard as stratos will immediately terminate the instance. But if
you could tail the logs, then i believe that you can get the logs from
where you tailed.
[1]
00:53:09,347 [-] [DataBridge-Core-pool-2-thread-8] INFO EVENT_TRACE_LOGGER
TenantId=-1234 : Input Event Adaptor : DefaultWSO2EventInputAdaptor,
received
Event{
streamId='cartridge_agent_health_stats:1.0.0',
timeStamp=0,
metaData=null,
correlationData=null,
payloadData=[cartridge-proxy.cartridge-proxy.cartridge-proxy.domain,
cartridge-proxy-1, RegionOne,
cartridge-proxy.cartridge-proxy.cartridge-proxy.domainc1dd3d8b-95ed-439c-96be-daaac2cb15a3,
whole-region, load_average, 0.0],
arbitraryDataMap={},
}
[2]
00:53:24,345 [-] [DataBridge-Core-pool-2-thread-4] INFO EVENT_TRACE_LOGGER
TenantId=-1234 : Input Event Adaptor : DefaultWSO2EventInputAdaptor,
received
Event{
streamId='cartridge_agent_health_stats:1.0.0',
timeStamp=0,
metaData=null,
correlationData=null,
payloadData=[cartridge-proxy.cartridge-proxy.cartridge-proxy.domain,
cartridge-proxy-1, RegionOne,
cartridge-proxy.cartridge-proxy.cartridge-proxy.domainc1dd3d8b-95ed-439c-96be-daaac2cb15a3,
whole-region, load_average, 0.0],
arbitraryDataMap={},
}
[3]
00:55:09,349 [-] [DataBridge-Core-pool-2-thread-10] INFO
EVENT_TRACE_LOGGER TenantId=-1234 : Input Event Adaptor :
DefaultWSO2EventInputAdaptor, received
Event{
streamId='cartridge_agent_health_stats:1.0.0',
timeStamp=0,
metaData=null,
correlationData=null,
payloadData=[cartridge-proxy.cartridge-proxy.cartridge-proxy.domain,
cartridge-proxy-1, RegionOne,
cartridge-proxy.cartridge-proxy.cartridge-proxy.domainc1dd3d8b-95ed-439c-96be-daaac2cb15a3,
whole-region, load_average, 0.0],
arbitraryDataMap={},
}
[4] TID: [0] [STRATOS] [2015-07-14 00:54:34,818] INFO
{org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} - Faulty
member detected [member-id]
cartridge-proxy.cartridge-proxy.cartridge-proxy.domainc1dd3d8b-95ed-439c-96be-daaac2cb15a3
with [last time-stamp] 1436835204346 [time-out] 60000 milliseconds
1436835204346 ==> 14 Jul 2015 00:53:24 GMT
Thanks,
Reka
On Tue, Jul 14, 2015 at 5:21 PM, Reka Thirunavukkarasu <[email protected]>
wrote:
> Thanks Shaheed for the update on the failover issue..I will go through the
> logs and share the observation further..
>
> Thanks,
> Reka
>
> On Tue, Jul 14, 2015 at 2:39 PM, Shaheedur Haque (shahhaqu) <
> [email protected]> wrote:
>
>> This thread breaks out the failover issue previously reported in the
>> thread below.
>>
>>
>>
>> *Scenario*
>>
>>
>>
>> 1. Start 2 orchestration controllers, octl-01 and octl-02. At this
>> stage, Stratos comes up on octl-01.
>>
>> 2. Shortly later, launch two single VM applications (cisco-sample
>> and cartridge-proxy)
>>
>> 3. Some time later, launch three more single VM applications
>> (di-000-001 to di-000-003)
>>
>> 4. Some time later, kill octl-01
>>
>> 5. Wait
>>
>> 6. About 60 seconds later or so, DNS switched to point Cartridge
>> Agents to octl-02 and then Stratos comes up on octl-02.
>>
>>
>>
>> *Expected result*
>>
>>
>>
>> · New Stratos instance picks up the old cartridges and carries on
>>
>>
>>
>> *Actual results*
>>
>>
>>
>> · The new Startos instance seems to think the old cartridges are
>> dead or something, and spins up new VMs.
>>
>> · All sorts of exceptions are being thrown.
>>
>>
>>
>> *Supporting info*
>>
>>
>>
>> Two tarballs attached. The “before” set was taken before step 4, the
>> “after” set was taken after step 6. Each contains tarballs for octl-01 and
>> octl-02. Thus, in the first, the wso2logs are at
>>
>>
>>
>> · tech-support-octl-01.tar, under /var/log/apache-stratos
>>
>>
>>
>> and in the second, they are at
>>
>>
>>
>> · tech-support-octl-02.tar, under /var/log/apache-stratos
>>
>>
>>
>>
>>
>>
>>
>> *From:* Shaheedur Haque (shahhaqu)
>> *Sent:* 07 July 2015 12:10
>> *To:* [email protected]
>> *Subject:* Problem with Stratos 4.1 autohealing (was RE: Problems with
>> Stratos 4.1 failover and autohealing)
>>
>>
>>
>> I met with several of the wso2 team, and we agreed the following next
>> steps:
>>
>>
>>
>> 1. Treat the “Stratos failover” issue as separate from the
>> “duplicate instances” issue, and use a new thread for the former, keeping
>> this thread for the latter.
>>
>> 2. The lack of synchronisation between the cleanup of the faulty
>> VMs and the startup of the replacement VMs was noted.
>>
>> a. It was mentioned that the period of overlap should be seconds,
>> perhaps upto 60s.
>>
>> b. I noted that for many of our scenarios, where the hardware is
>> 1:1 mapped to VMs (i.e. neither over-subscribed nor under-subscribed), the
>> overlap may be a problem. I will consult with others and confirm. Imesh
>> ack’d this point.
>>
>> c. I will provide the wso2 logs for my test case (summarised below).
>>
>> d. I will check with the test team where the problem was originally
>> detected if this is a transient (upto 60s) or long-lived, and gather logs
>> as needed.
>>
>>
>>
>> *From:* Imesh Gunaratne [mailto:[email protected]]
>> *Sent:* 06 July 2015 17:11
>> *To:* dev
>> *Subject:* Re: Problems with Stratos 4.1 failover and autohealing
>>
>>
>>
>> On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <
>> [email protected]> wrote:
>>
>>
>>
>> If it detects that cartridge agent has not published any health event
>> within the given time window (60mins), it will publish the MemberFault
>> event.
>>
>>
>>
>> This should be 60 seconds.
>>
>>
>>
>> On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <
>> [email protected]> wrote:
>>
>> Hi Shaheed,
>>
>>
>> Regarding Stratos fail-over scenario, by looking at the logs it seems
>> that there are multiple member fault cases detected by CEP. This
>> FaultHandlingWindowProcessor [1] running inside CEP is implemented to check
>> for health stats published by cartridge agents. It does not check whether
>> actual VM or processes running inside the VM is active or not. If it
>> detects that cartridge agent has not published any health event within the
>> given time window (60mins), it will publish the MemberFault event.
>>
>> Ideally this should not happen regardless of whether the deployment is HA
>> or not. Because FaultHandlingWindowProcessor will count the 60mins interval
>> from the point Stratos server started. Only reason I could think of which
>> may have caused these faulty member logs is Stratos (or external CEP) did
>> not receive health stats published by cartridge agents on time.
>>
>>
>>
>> Are you running Stratos in single-JVM mode? If so, I'd highly recommend
>> that you deploy the CEP profile separately. Please refer to [2] for
>> reference deployment architecture of Stratos HA setup.
>>
>>
>>
>> Regarding your second concern; Kill multiple Cartridges, I believe your
>> concern is that the first message relating to recovery of the CPS came
>> 66s after faulty member detection and recovery of other VMs started before
>> that. I'd like to summarize few points regarding Stratos recovery process
>> which I think will explain that behavior.
>>
>>
>>
>> - Stratos detects faulty members only by checking the health stats
>> published by cartridge agent and not by checking the existence of the
>> actual VM.
>>
>> - Upon detecting such faulty member, CEP will publish a faulty member
>> event and after receiving such event AS will move that instance to obsolete
>> list. AS will iterate over this obsolete list and try to terminate those
>> instances via CC. When the termination is complete, CC will publish member
>> terminated event, upon receiving that AS will remove that member from
>> obsolete list.
>>
>> - If CC fails to terminate the instance, it will keep retrying until a
>> obsolete member timeout interval has passed. When the timeout expires, AS
>> will forcefully remove that member from obsolete list.
>>
>> - Stratos will not try to re-spawn the faulty instance right after
>> detecting it. This recovery process is triggered by the Cluster monitor
>> object. There is a Drools logic which gets executed periodically (monitor
>> interval is configurable, default is 90s) which will execute the min-check
>> rule, scale-up rule and scale-down rule and decide whether to spawn new
>> instances. So this recovery process could take some time depending on your
>> monitor interval.
>>
>>
>>
>>
>>
>> So in retrospect;
>>
>> - There is no way to control which cluster monitor gets executed first,
>> so therefore it is possible that CPS recovery takes place after other VMs.
>>
>> - The reason for seeing lot of termination logs is because
>> Stratos continuously tries to terminate those faulty members until obsolete
>> member timeout has expired. This could occur if the actual VM was killed
>> manually or by other means.
>>
>>
>>
>> However, we need to figure out why you got these faulty member events.
>> Could you send us the CEP trace log for further analysis? Also you can
>> enable debug logs for FaultHandlingWindowProcessor class to get more
>> insights as to what's going on.
>>
>>
>>
>> I hope this would explain your concerns. We can have a Hangout session to
>> discuss this further.
>>
>>
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/STRATOS/4.1.0+Configuring+HA+in+Single+JVM+Mode+on+EC2
>>
>> [2]
>> https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java
>>
>>
>>
>> Thanks.
>>
>>
>>
>> On Mon, Jul 6, 2015 at 5:00 PM, Shaheedur Haque (shahhaqu) <
>> [email protected]> wrote:
>>
>> Hi,
>>
>>
>>
>> I am seeing some issues around Stratos 4.1 around failover on the one
>> hand (i.e. stop one running instance of Startos, start another as a
>> replacement), and autohealing (i.e. a Cartridge dies and need to be
>> replaced by Stratos) which feel as though they might have a common cause,
>> so I am lumping them together for now. I’ll explain the two scenarios as I
>> can…and the attach some logs explaining one scenario in detail.
>>
>>
>>
>> *Failover of Stratos*
>>
>>
>>
>> The situation here is that I have 3 servers, all running MySQL in
>> active-active mode, and Stratos running on at most one of them under
>> Pacemaker control. The test case is that I kill the server (actually a VM)
>> running Stratos, and Pacemaker responds by starting Stratos on one of the
>> remaining servers. Around the time of the action, there is no ongoing REST
>> activity in terms of defining new Cartridges, or Applications or anything
>> like that.
>>
>>
>>
>> · Expected result: the replacement Stratos instance “acquires”
>> the running Cartridges, and the system continues without killing/replacing
>> any Cartridges.
>>
>> · Actual result: all Cartridges are kill and restarted.
>>
>>
>>
>> *Kill multiple Cartridges*
>>
>>
>>
>> I have a test where several Cartridges are killed at the same time as
>> follows. We have a Cartridge Proxy Server (CPS) that hosts the (Java)
>> Cartridge agent for VMs which cannot run one natively. The CPS therefore
>> runs N+1 JCAs, one for itself plus N for the Cartridges it is proxying.
>> Killing the CPS therefore causes all N+1 JCAs to disappear, and from a
>> Startos point of view, all N+1 VMs need to be restarted.
>>
>>
>>
>> · Expected result: each Cartridges is replaced by Stratos.
>>
>> · Actual result: looking at the Stratos logs, it seems that there
>> is a race between the termination logic which cleans up as old cluster
>> members dies against the startup logic which is spinning up replacement VMs
>> for the terminated ones.
>>
>>
>>
>> Both cases previously seemed to work fine, where we **think** previously
>> means versions of Stratos 4.1 dating back ~3 weeks or so. I attach a series
>> of log extracts for the second scenario; I suspect this also covers the
>> first, even if not, at least we can start here.
>>
>>
>>
>> 1. Original spin up of CPS is at 2015-06-11 19:50:20,585. CPS fault
>> seen around 2015-06-12 18:54:51, contemporaneously, all the other faults
>> are seen:
>>
>>
>>
>> ID: [0] [STRATOS] [2015-06-12 18:54:51,788] INFO
>> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} - Faulty
>> member detected [member-id]
>> di-000-007.di-000-007.cisco-qvpc-sf-0.domain88ef85b5-5ec6-4d4e-b7b2-1ab9d3f7b160
>> with [last time-stamp] 1434135185318 [time-out] 60000 milliseconds
>>
>> ...
>>
>> TID: [0] [STRATOS] [2015-06-12 18:54:51,803] INFO
>> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} - Faulty
>> member detected [member-id]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
>> with [last time-stamp] 1434135192556 [time-out] 60000 milliseconds
>>
>> ...
>>
>> TID: [0] [STRATOS] [2015-06-12 18:54:51,837] INFO
>> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} - Faulty
>> member detected [member-id]
>> cartridge-proxy.cartridge-proxy.cartridge-proxy.domain95a58a3e-e7a9-4071-b42b-3551bad25e6a
>> with [last time-stamp] 1434135189149 [time-out] 60000 milliseconds
>>
>> ...
>>
>> etc
>>
>> ...
>>
>> TID: [0] [STRATOS] [2015-06-12 18:54:51,862] INFO
>> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} - Faulty
>> member detected [member-id]
>> di-000-010.di-000-010.cisco-qvpc-sf-0.domain2ab24c1d-53cb-452e-8aa2-f02eecf2db78
>> with [last time-stamp] 1434135193054 [time-out] 60000 milliseconds
>>
>> ...
>>
>> etc
>>
>> ...
>>
>>
>>
>> 2. The last message relating to the CPS failure is at 18:54:52. The
>> first message relating to recovery of the CPS is some 66 seconds later:
>>
>>
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:58,516] INFO
>> {org.apache.stratos.autoscaler.rule.RuleLog} - [min-check] Partition
>> available, hence trying to spawn an instance to fulfil minimum count!
>> [cluster] cartridge-proxy.cartridge-proxy.cartridge-proxy.domain
>>
>>
>>
>> 3. However, recovery of other VMs started as much as 54 seconds
>> before this point to as much as 26 seconds after:
>>
>>
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:04,564] INFO
>> {org.apache.stratos.autoscaler.client.CloudControllerClient} - Trying to
>> spawn an instance via cloud controller: [cluster]
>> di-000-008.di-000-008.cisco-qvpc-sf-0.domain [partition] whole-region
>> [network-partition-id] RegionOne
>>
>> ...
>>
>> TID: [0] [STRATOS] [2015-06-12 18:56:20,603] INFO
>> {org.apache.stratos.autoscaler.client.CloudControllerClient} - Trying to
>> spawn an instance via cloud controller: [cluster]
>> di-000-007.di-000-007.cisco-qvpc-sf-0.domain [partition] whole-region
>> [network-partition-id] RegionOne
>>
>>
>>
>> 4. Between point 2 and 3 (with fuzz for the races), I see Stratos
>> cleaning the old VMs. For example:
>>
>>
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:04,580] INFO
>> {org.apache.stratos.cloud.controller.iaases.JcloudsIaas} - Starting to
>> terminate member: [cartridge-type] cisco-qvpc-sf-0 [member-id]
>> di-000-008.di-000-008.cisco-qvpc-sf-0.domaina0f15eed-9599-4f3e-a70a-93ddd02ccf5f
>>
>>
>>
>> 5. However, around that point, look at this:
>>
>>
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,835] INFO
>> {org.apache.stratos.autoscaler.rule.RuleLog} - [scale-up] Trying to scale
>> up over max, hence not scaling up cluster itself and
>>
>> notifying to parent for possible group scaling or
>> app bursting.
>>
>> [cluster]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1
>> [max] 1
>>
>>
>>
>>
>>
>> 6. Is it possible that there is a race between the terminates and
>> restarts? It certainly seems so, though a detailed knowledge of Startos is
>> needed to confirm it:
>>
>>
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,798] INFO
>> {org.apache.stratos.autoscaler.rule.RuleLog} - [min-check] Partition
>> available, hence trying to spawn an instance to fulfil minimum count!
>> [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,798] INFO
>> {org.apache.stratos.autoscaler.client.CloudControllerClient} - Trying to
>> spawn an instance via cloud controller: [cluster]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [partition] whole-region
>> [network-partition-id] RegionOne
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,799] INFO
>> {org.apache.stratos.common.client.CloudControllerServiceClient} -
>> Terminating instance via cloud controller: [member]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,812] INFO
>> {org.apache.stratos.cloud.controller.messaging.publisher.TopologyEventPublisher}
>> - Publishing member created event: [service-name] cisco-qvpc-sf-0
>> [cluster-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
>> [cluster-instance-id] di-000-005-1 [member-id]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
>> [instance-id] null [network-partition-id] RegionOne [partition-id]
>> whole-region [lb-cluster-id] null
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,817] INFO
>> {org.apache.stratos.messaging.message.processor.topology.MemberCreatedMessageProcessor}
>> - Member created: [service-name] cisco-qvpc-sf-0 [cluster-id]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [member-id]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
>> [cluster-instance-id] di-000-005-1
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,834] ERROR
>> {org.apache.stratos.autoscaler.rule.RuleTasksDelegator} - Request in
>> flight threshold is Zero
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,835] INFO
>> {org.apache.stratos.autoscaler.rule.RuleLog} - [scale-up] Trying to scale
>> up over max, hence not scaling up cluster itself and
>>
>> notifying to parent for possible group scaling or
>> app bursting.
>>
>> [cluster]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1
>> [max] 1
>>
>>
>>
>> (I tried to file a JIRA on this, but I’m having a bit of trouble, hence
>> posting here to get the ball rolling).
>>
>>
>>
>> Thanks, Shaheed
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Akila Ravihansa Perera
>> Software Engineer, WSO2
>>
>> Blog: http://ravihansa3000.blogspot.com
>>
>>
>>
>>
>>
>> --
>>
>> Imesh Gunaratne
>>
>>
>>
>> Senior Technical Lead, WSO2
>>
>> Committer & PMC Member, Apache Stratos
>>
>
>
>
> --
> Reka Thirunavukkarasu
> Senior Software Engineer,
> WSO2, Inc.:http://wso2.com,
> Mobile: +94776442007
>
>
>
--
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007