Aled Sage created BROOKLYN-16:
---------------------------------

             Summary: Quarantine group: improve functionality and usability
                 Key: BROOKLYN-16
                 URL: https://issues.apache.org/jira/browse/BROOKLYN-16
             Project: Brooklyn
          Issue Type: Improvement
            Reporter: Aled Sage


I'd like us to clean up the behaviour and appearance of the "quarantine group" 
of clusters. My recent experience with some enterprise users highlights that 
it's confusing!

The configuraiton "dynamiccluster.quarantineFailedEntities" controls whether 
failed members of the cluster should be quarantined, or just deleted straight 
away.

Unquarantining
-------------------
Once an entity goes into quarantine, there is currently no way to get it out 
again (except deleting or discarding the entity).

However, it is good we don't add unquarantine nodes automatically (e.g. on the 
entity going to service-up again) because it may have been quarantined for good 
reason, such as going up+down.

PROPOSAL 1: We should have an explicit effector on the quarantine group entity 
to move the member back into the cluster's group of healthy members.

PROPOSAL 2: We should add a dynamic effector to each member of the quarantined 
group for "restoreFromQuarantine", which would add the member back into the 
cluster's group.
A user could invoke this effector by selecting the member in the web-console.

PROPOSAL 3: We could add an effector "restartMembers(boolean parallel)" on the 
quarantine group. Invoking this would restart the process for each member of 
the quarantine group. If parallel==true then this would be done in parallel, 
otherwise one member at a time.

PROPOSAL 4: We should have an explicit effector on the cluster to quarantine a 
member.


QuarantineGroup.expungeMembers
----------------------------------------------
There is an expungeMembers effector on the quarantine group. This takes a 
single parameter of "boolean firstStop", which controls whether it calls 
entity.stop() before unmanaging each entity.

The parameter name is confusing. Also the two behaviour is very different for 
the two parameter values, so potentially deserves two separate effectors.

Note this feels related to the "expunge" operation under the "lifecycle" tab of 
the web-console. There, it brings up a modal dialog with "Unmange an entity and 
(optionally) clean up resources, such as releasing a VM" and a checkbox for 
"Release resources".
The user feedback there was that it isn't the behaviour they expected when 
clicking "expunge". And that the behaviour was so different with the box ticked 
or unticked that it deserved two different operations.

PROPOSAL 5: replace the existing effector with two effectors: 
`unmanageMembers()` would just unmanage the entities without stopping or 
freeing the resources; `stopAndUnmanageMembers()` would first release the 
resources of each member (e.g. VMs etc, by calling entity.stop) and would then 
unmanage each.


Quarantine alternative
----------------------------
In our use-case, we're using docker. What we really want for this kind of 
failed node is to... generate a dump of the running process, and then stop the 
container (thus preserving the disk). We want the entity to be discarded from 
the cluster.

PROPOSAL 6: Add another config option to DynamicCluster for 
failedEntityHandler. This would take an instance of something like:

    public interface FailedEntityHandler {
        public enum HandlerResponse {
            DISCARD_ENTITY,
            STOP_AND_DISCARD_ENTITY,
            ADD_TO_QUARANTINE,
            KEEP_IN_GROUP;
        }
       
        HandlerResponse onFailedEntity(DynamicCluster cluster, Entity 
failedMember);
    }


Visualization
----------------
Currently... if quarantined, then the entity tree (in the web-console) shows a 
"quarantine group" underneath (i.e. as a child of) the cluster.

All entities in the cluster (be they members of the quarantine group or healthy 
members of the cluster) appear under the cluster itself. This is because their 
*parent* is the cluster. An entity's parent never changes. What the user is 
really interested here is seeing the group membership.

There's a separate conversation to be had (or resurrected) about visualising 
groups (and other relationships) in the web-console. This use-case should be 
considered there.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to