[ 
https://issues.apache.org/jira/browse/STRATOS-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487361#comment-14487361
 ] 

Shaheed Haque commented on STRATOS-1234:
----------------------------------------

Hi Imesh, Sandaruwan,

Here is a written-up proposal. I *think* it covers the various use cases 
suggested both here and in JIRA STRATOS-1234, but as always, your thoughts on 
the matter are welcome. The write-up has the form of a “spec” and a “Q&A”. As a 
next step, I guess we could do a hang-out or con-call or something?

Thoughts welcome…

Thanks, Shaheed

OPERATIONAL STATE COMMANDS

The following commands, with the defined effects, are needed:


·        No command directly affects what I call the “major state” of the 
Application/Group/Cluster/Cartridge, i.e. the state as reflected in the 
information CURRENTLY returned by the application/{appId}/runtime information.

·        Each command affects what I call the “operational state” only. The 
commands and their operational states are:

o   Autoscaling on, off. Autoscaling on is current behaviour.

o   Autohealing on, off. Autohealing on is current behaviour.

o   Maintenance off, restart, replace. Maintenance off is current behaviour.

o   (We can add more later if needed)

Command

Server effect

Cartridge effect

Autoscaling off.

CEP and gathers stats and history as usual. Autoscalar operates as usual, 
except that no scaling is done. Instead, a cluster state variable tracks the 
normal, overload or underload state and logs messages when this state variable 
changes value.

No effect on running cartridges. No new cartridges are spun up, no existing 
cartridges are spun down EXCEPT for autohealing.

Autohealing off.

CEP ignores any heartbeat timeout other than to log that it happened, and set 
an instance state variable to track this.
When autohealing is turned back on, the timeout will happen again, and the 
failure will be acted upon normally, except that the log shall make it clear 
(using the instance state variable) that the autohealing had been delayed.

No new cartridges are spun up until after the autohealing is enabled.

Maintenance restart.

Like autohealing off except that the an extra state variable is set indicating 
maintenance mode is in effect.

The both state variables are cleared when the Cartridge resume event is seen.

Cartridge is signalled with an *event*, not a blocking callout.

Cartridge application must be able to reboot or just restart, and have the 
cartridge agent resume its previous (active/inactive) state. When resuming, the 
agent signals the server with a resume *event*.

Note this implies the cartridge agent is restartable (because the application 
can choose to reboot).

Maintenance replace.

Like maintenance restart except that the cartridge instance is replaced.

The difference between “restart” and “replace” is that the latter is for 
applications that cannot update themselves, but expect essentially a new VM 
instance with the new software.

In other words, this is the big hammer/most general approach to upgrades (e.g. 
this is more likely to work that an apt-get downgrade ☺).



·        Each command referred to here is a REST API call.

·        Each command can apply to an entire Application, or any nested level 
(group or cartridge) within it.

·        Arguments for application-wide use case:

o   application={appId}, operationalState={command}

·        Arguments for nested-level use case:

o   application={appId}, nesting={0}/{1}/{2}/…/{n}, operationalState ={command}

Q&A


1.      What’s the point of restart/replace, over and above auto* off?



These are to actually cause the application software in the VM instance to take 
note to do something. Typically, I would expect this to result in an 
internally-managed software update. For example think of a VMs running Ubuntu, 
and pointing to a known repository of say security patches, they could all just 
do a “apt-get update/upgrade”.



The Cartridge logic is defined to be event-based rather than blocking, because 
making the thing blocking would be a problem if a reboot was involved. (Also, 
generally, blocking operations in a distributed system raise too many edge 
cases like: can this operation be cancelled? Repeated? etc.).


2.      Propagation/inheritance rules



I see two options:



·        Use hierarchy. If you apply a thing a hierarchy level n, and n has 
internal structure (i.e. it is a group not a cartridge), the command propagates 
all the way down (note: this is implied in what I said for the application 
level command).

·        Do not use hierarchy. The command only applies to the level to which 
is was addressed by the REST call.

In either case, the effect of contradictory commands is UNDEFINED, i.e. 
toggling the flags in quick succession will likely result in an unhelpful 
outcome.

I think the normal approach is NOT to use hierarchy; after all just because 
there is a upgrade to be applied for application code in a given set of VMs, 
there is nothing to say that any elements lower down the hierarchy should be 
upgraded at the same time. Even in the case where (say) security patches to a 
common OS are to be applied, I would doubt the sanity of anybody doing this 
across every VM in the whole system in one go ☺. OTOH, maybe I am wrong!


3.      Should these commands apply to “deployed” or only to “configured” 
Applications?



I think the commands can be applied whether the Application is deployed or 
not….clearly the stuff that sets flags on instances has to set those flags on 
all current and future instances that may spin up under a given deployment.



From: Imesh Gunaratne [mailto:im...@apache.org]
Sent: 27 March 2015 04:21
To: dev
Subject: Re: Maintenance modes (was RE: [jira] [Commented] (STRATOS-1234) 
Software Update Management Solution for Stratos)

Hi Shaheed,

A really good suggestion! I think we could to manage what you have suggested in 
the same implementation as they overlap. I'm +1 for the idea of putting a 
cluster into the "Maintenance Mode" manually for diagnostic purposes and stop 
autoscaling it. We could introduce new API methods to manage this. The only 
question is whether we could use the same instance state for all the scenarios:

1. Update platform (might need to use the term platform here as it may get 
confused with the software that may run on the platform)
2. Apply patches
3. Pause a cluster for diagnostic purposes

I would like to suggest to change the updateSoftware API method to 
updatePlatform:
POST /applications/{applicationId}/updatePlatform

May be we could introduce a new API method as follows to put a cluster into 
"Maintenance/Diagnostic Mode":
POST /clusters/{clusterId}/pause

Thanks
Imesh

On Thu, Mar 26, 2015 at 3:01 PM, Shaheedur Haque (shahhaqu) 
<shahh...@cisco.com<mailto:shahh...@cisco.com>> wrote:

First, let me say that I like a lot of what is proposed in this JIRA, but I am 
forking the thread here because I would like to suggest that we generalise just 
one part of it, the API into Stratos to cover a set of related use cases.

In the current version of this JIRA, the proposed API into Stratos looks like 
this:

PUT /api/applications/{applicationId} /updateSoftware

(see the JIRA section 2.3 for the details). I think this is actually one of a 
set of possible runtime states that we would like to put VM instances and 
various parts of Stratos in. Notice that I am deliberately not using specific 
terms such as "cluster" or "Autoscalar" because working that out is the point 
of this email.

So, the sorts of use cases I have in mind are:

  *   Updating the cartridge software as per this JIRA
  *   Putting a cluster (or maybe an instance) into a "maintenance mode" for 
diagnostic reasons. There could be multiple versions of this maintenance mode 
where (for example)

     *   The instance(s) might still handle traffic and deliver "I'm alive" 
health stats but no autoscaling is done.
     *   The instance(s) don't deliver health stats but no health stats

  *   Some of these would deliver notifications to the cartridge agent, others 
might only affect Stratos component(s).
  *   etc...other ideas anybody?

Thus, it might make sense to generalise the API to support  a set of closely 
related cases. Is there interest in taking such an approach to address this 
JIRA as well in clarifying and addressing the other use cases?


Thanks, Shaheed



>  Software Update Management Solution for Stratos
> ------------------------------------------------
>
>                 Key: STRATOS-1234
>                 URL: https://issues.apache.org/jira/browse/STRATOS-1234
>             Project: Stratos
>          Issue Type: New Feature
>            Reporter: Imesh Gunaratne
>              Labels: gsoc2015, mentor
>
> Stratos uses Virtual Machines and Containers for hosting platform services on 
> different Infrastructure as a Service (IaaS) solutions. At present Puppet is 
> used for orchestration management on Virtual Machine based systems and 
> manages all required software in Puppet Master. Container based systems 
> creates Docker images for each platform service by including required 
> software in the Docker image itself.
> In Virtual Machine use-case VM instances will communicate with Puppet master 
> and execute the software installation. The same approach can be used for 
> applying software updates. 
> In Docker use-case we do not use Puppet because a new container with required 
> software can be started in few seconds. This is very efficient compared to 
> using Puppet and installing software on demand.
> The requirement of this project is to implement a core Stratos feature to 
> propagate software updates in a live PaaS environment.
> 1. Puppet based solution:
> - Push software updates of a cartridge to Puppet Master (might not need to 
> automate).
> - Invoke the software update process via the Stratos API for a given 
> application.
> - Stratos Manager could send a new event to trigger puppet agent in each 
> instance to apply the updates.
> 2. Docker based solution
> - Create a new docker image (with a new image id) for the cartridge with 
> software updates (might not need to automate).
> - Invoke the software update process via the Stratos API for a given 
> application.
> - Autoscaler can implement a new feature to bring down existing instances and 
> create new instances with the new docker image id.
> Important!
> - In each scenario if updates are backward compatible, software update 
> process should execute in phases, it should not bring down the entire cluster 
> to apply the updates. If so the service will be unavailable for a certain 
> time period. The idea is to apply the updates to set of members at a time.
> - If the updates are not backward compatible, we could make the entire 
> cluster unavailable at once and apply the updates.
> - Member's state needs to be changed to a new state called "Updating" when 
> applying the updates.
> If there is an interest on doing this project please send a mail to imesh at 
> apache dot org by copying Apache Dev mailing list [1]. Please refer Stratos 
> Wiki [2] for more information on Stratos architecture and how it works.
> [1] http://stratos.apache.org/community/mailing-lists.html
> [2] https://cwiki.apache.org/confluence/display/STRATOS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to