Re: Member termination took 30 minutes

Rajkumar Rajaratnam Wed, 11 Feb 2015 17:28:29 -0800

Hi Sajith,

Please find my comments inline.

On Thu, Feb 12, 2015 at 1:06 AM, Sajith Kariyawasam <saj...@wso2.com> wrote:

> Hi Devs,
>
> While testing group scaling, I noticed when scaling down it takes 30
> minutes from the moment scaling rule decides to terminate an instance.
>
> An active member, which was selected by the rule, first moves to a
> "termination pending member map", and after a certain period
> (terminationPendingMemberExpiryTime) that member
> moves to an "obsolete member map". Then by the obsolete check rule, that
> member will be terminated via cloud controller.
>
> It seems because of the property  terminationPendingMemberExpiryTime,
> default value of which is 30 minutes, this takes that amount of time to get
> terminated
>
> Sorry for asking, I might have missed some past discussions regarding
> this, could someone explain the purpose of moving the member to an
> intermediary map "termination pending member map", rather than moving
> directly to "obsolete member map"?
>

The reason is to avoid event lost and graceful termination. Let me explain
the logic.

   - When scaling down, AS will move the member from "active member list"
   to "termination pending member map".
   - There is a drool-rule "Cleanup Instances which are pending
   termination" which will run periodically and take all the members which are
   in "termination pending member map" and publish instance clean up event.
   - When CA receives instance clean up event, it will publish instance
   ready to shutdown.
   - When CC receives instance ready to shutdown event, it will publish
   member ready to shutdown.
   - When AS receives member ready to shutdown event, it will move the
   member from  "termination pending member map" to "obsolete member map".
   - Hence, until AS receives member ready to shutdown event, it will keep
   publishing instance clean up event in every cluster monitor interval (drool
   is running)
   - If AS is not receiving member ready to shutdown event for a member in
   "termination pending member map" within 30 min (upper limit), this member
   will be moved to obsolete list without waiting for the member ready to
   shutdown event.

The reason for this complete cycle is graceful termination. If we put the
member into "obsolete member map", it will not be terminated gracefully.

The reason why we are moving the member from "active member list" to
"termination pending member map" is to avoid event lost. We have had
situations where some event is lost in the above cycle. These events are
published only once. If we lost one event in this cycle, that member will
not be terminated forever. That is why we are putting the member in the
map. In every cluster monitor interval, we are taking all the members in
the "termination pending member map" and send the instance clean up event.
This will overcome event lost.

30 min is the upper limit, maximum time a member can resides in
"termination pending member map". You have faced the edge scenario, where
AS didn't receive the member ready to shutdown event. So AS took 30 min to
move the member to obsolete list.

>
> Also, is terminationPendingMemberExpiryTime parameter configurable?
> (seems not) , and any reason for it to set to 30 minutes?
>

This is not configurable yet. But other member list/map expiry times are
configurable AFAIR.

>
> Further, we should make sleep times of  PendingMemberWatcher,
> ObsoletedMemberWatcher and TerminationPendingMemberWatcher configurable.
> WDYT?
>

Yes we have to.

>
> We need to document those configurable parameters as well, @Mari please
> note.
>
>
> Thanks,
> Sajith
>
>
>

-- 
Rajkumar Rajaratnam
Committer & PMC Member, Apache Stratos
Software Engineer, WSO2

Mobile : +94777568639
Blog : rajkumarr.com

Re: Member termination took 30 minutes

Reply via email to