Re: stop stratos, kill cartridge instance, start stratos -> cartridge still active?

Michiel Blokzijl (mblokzij) Fri, 04 Jul 2014 01:54:24 -0700

Hi all,

Apologies for the radio silence since my initial email, I’ve been very busy.. :(

Thank you Reka for your detailed explanations, I now have a much better understanding of how it’s supposed to work!

I’m not actually using the BAM (yet)*, so STRATOS-685 shouldn’t affect me, right? Even if it doesn’t affect me I think the fix would still be nice to have in the 4.0.0 branch.

> Just to narrow down the cause of this issue, will you be able to list down the actions that you carried out from the very beginning please? Then we could try to re-produce this problem by going through them.

I’ve attached an annotated log of the steps I’ve taken to reproduce the issue.

I think there’s still an issue in this area, since I’m hitting this issue without using the BAM. I could try Reka’s suggestion of enabling the CEP persistence, but I suspect given that restarting Stratos takes more than 1min, the fault handler will think that ALL cartridges are inactive and kill them all. Does anyone know if this is the right documentation for setting up CEP snapshotting?

*: The <BamServerURL> is commented out in <stratos>/repository/conf/carbon.xml.

Best regards,

Michiel

stratos-restart-cartridge-lost.log
Description: Binary data

On 29 Jun 2014, at 18:02, Reka Thirunavukkarasu <[email protected]> wrote:

Hi

On Sun, Jun 29, 2014 at 9:28 PM, Lakmal Warusawithana <[email protected]> wrote:

Hi Reka,

We can double commit these into 4.0.0 branch and master, and will do 4.0.1 minor release with these fixers. I also like suggest some UX improvements for 4.0.1 release. I had some offline discussion with several folks, will send some suggestions on UX improvement with the user stories in separate thread.

+1 for the 4.0.1 release with all the minor fixes and UI improvements. Then will commit the fixes done to the 4.0.0 as well.

Thanks,
Reka

thanks

On Sun, Jun 29, 2014 at 9:05 PM, Reka Thirunavukkarasu <[email protected]> wrote:

Hi Cris,

On Sat, Jun 28, 2014 at 11:54 AM, chris snow <[email protected]> wrote:

Hi Reka, will this fix also need to get applied to 4.0.0?

Yah. As Isuru mentioned, we can apply it as a patch to 4.0.0. The issue will be there only when you publish events to BAM from cloud controller and when you unsubscribe from an instance. I will create a patch from 4.0.0 branch with the fix and update the jira with the patch..

Thanks,
Reka

Thanks,
Reka

On 26 Jun 2014 06:43, "Reka Thirunavukkarasu" <[email protected]> wrote:

Hi all,

On Wed, Jun 25, 2014 at 11:44 PM, Nirmal Fernando <[email protected]> wrote:

On Wed, Jun 25, 2014 at 10:51 PM, Imesh Gunaratne <[email protected]> wrote:

Hi Michiel,

As Reka has pointed out there is a potential issue in CloudControllerServiceImpl class. It seems like cloud controller is retrieving its state from registry in CloudControllerServiceImpl constructor and it's being invoked in two other places than it's expected to:

<Screen Shot 2014-06-25 at 10.36.07 PM.png>

<Screen Shot 2014-06-25 at 10.14.01 PM.png>

This was a bug, we identified recently and someone has made this commit without properly analyzing the way CC has implemented. :-(

AFAIK Reka has already filed a jira and on her way to remove that broken logic.
I have fixed this issue in master and updated the jira (STRATOS-685). I have removed CloudControllerServiceImpl initialization which used in cloud controller when publishing events to BAM and in the instance termination on behalf of the MemberReadyToShutdownEvent.

The fix that i did was to get the relevant cartridge information from FasterLookupDataHolder when publishing events to BAM instead of getting it from buggy way as earlier. Handled the instance termination via Autoscaler on behalf of MemberReadyToShutdownEvent instead of CC itself terminates the member. I think that this would be good way as autoscaler is the one who requests to start or terminate the member in all scenarios.

Thanks,
Reka

However the above logic does not retrieve the topology from registry. It is being retrieved by Topology Manager:

<Screen Shot 2014-06-25 at 10.45.36 PM.png>

Therefore the above issue may have very little affect on the problem you have noticed. However I wonder whether we have an issue in Autoscaler in refreshing its state once restarted.

Just to narrow down the cause of this issue, will you be able to list down the actions that you carried out from the very beginning please? Then we could try to re-produce this problem by going through them.

Many Thanks
Imesh

On Mon, Jun 23, 2014 at 9:53 PM, Michiel Blokzijl (mblokzij) <[email protected]> wrote:

Hi all,

Basically, I was stopping and starting Stratos and looking at how it handled dying cartridges, and found that Stratos only detected cartridge deaths while it was running..

The problem
In steady state, I have some cartridges managed by Stratos,

./stratos.sh list-subscribed-cartridges | grep samp
| cisco-sample-vm | cisco-sample-vm | 1     | Single-Tenant | cisco-sample-vm | Active | 1     | cisco-sample-vm.foo.cisco.com |

nova list | grep samp
| 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE | None     | Running     | core=172.16.2.17, 10.86.205.231  |

All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample cartridge, and then start ActiveMQ and Stratos again.

Now, at first things look good..:

./stratos.sh list-subscribed-cartridges | grep samp
| cisco-sample-vm | cisco-sample-vm | 1     | Single-Tenant | cisco-sample-vm | Inactive | 0     | cisco-sample-vm.foo.cisco.com |

But then,

root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh list-subscribed-cartridges | grep samp
| cisco-sample-vm | cisco-sample-vm | 1     | Single-Tenant | cisco-sample-vm | Active | 1     | cisco-sample-vm.foo.cisco.com |

# nova list | grep samp
#

How did the cartridge become active without it actually being there? As far as I can tell, Stratos never recovers from this.

I found this bug here: https://issues.apache.org/jira/browse/STRATOS-234 - is this describing the issue I’m seeing? I was a little bit confused by the usage of the word “obsolete”.

Where to go next?
Now, I’ve done a little bit of digging, but I don’t yet have a full mental model of how everything fits together in Stratos - please could someone help me put the pieces together? :)

What I’m seeing is the following:
- The cluster monitor appears to be active:

TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG {org.apache.stratos.autoscaler.monitor.ClusterMonitor} -  Cluster monitor is running.. Cluste
rMonitor [clusterId=cisco-sample-vm.cisco-sample-v, serviceId=cisco-sample-vm, deploymentPolicy=Deployment Policy [id]static-1 [partitions] [org
.apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0], autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null, de
scription=null], lbReferenceType=null] {org.apache.stratos.autoscaler.monitor.ClusterMonitor}

- It looks like the CEP FaultHandlingWindowProcessor usually detects inactive members. However, since this member was never active, the timeStampMap doesn’t contain an element for this member, so it’s never checked.

- I think the fault handling is triggered by a fault_message, but I didn’t manage to figure out where it’s coming from. Does anyone know what triggers it? (is it the CEP extension?)

Anyway..

Questions
- How should Stratos detect after some downtime which cartridges are still there and which ones aren’t? (what was the intended design?)
- Why did the missing cartridge go “active”? Is this a result from restoring persistent state? (If I look in the registry I can see stuff under subscriptions/active, but not sure if that’s where it comes from)

- Who should be responsible for detecting the absence of an instance - the ClusterMonitor? That seems to be fed incorrect data, since it clearly thinks there are enough instances running. Which component has the necessary data?

- It looks like it’s possible to snapshot CEP state to make it semi-persistent. However, if I restarted Stratos after 2min downtime, wouldn’t it try to kill all the nodes since the last reply was more than 60s ago? Also, snapshots would be periodic, so there’s still a window in which cartridges might “disappear".

Thanks a lot and best regards!

Michiel

--
Imesh Gunaratne

Technical Lead, WSO2
Committer & PPMC Member, Apache Stratos

--
Best Regards,
Nirmal

Nirmal Fernando.
PPMC Member & Committer of Apache Stratos,
Senior Software Engineer, WSO2 Inc.

Blog: http://nirmalfdo.blogspot.com/

--
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007

--
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,

Mobile: +94776442007

--
Lakmal Warusawithana
Vice President, Apache Stratos
Director - Cloud Architecture; WSO2 Inc.

Mobile : +94714289692
Blog : http://lakmalsview.blogspot.com/

--
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: stop stratos, kill cartridge instance, start stratos -> cartridge still active?

Reply via email to