Re: [users] Avoid rebooting payload modules after losing system controller

Mathivanan Naickan Palanivelu Tue, 13 Oct 2015 06:56:05 -0700

I don't think this is a case of cattles! Even in those scenario
the cloud management stacks, the  "**controller" software themselves are 
'placed' on physical nodes
in appropriate redundancy models and not inside those cattle VMs!


I think the case here is about avoid rebooting of the node!
This can be achieved by setting the NOACTIVE timer to a longer value till 
OpenSAF on the controller comes back up.
Upon detecting that the controllers are up, some entity on the local node 
restart OpenSAF (/etc/init.d/opensafd restart)
And ensure the CLC-CLI scripts of the applications differentiate usual restart 
versus this spoof-restart!

Mathi.

> -----Original Message-----
> From: Anders Widell [mailto:[email protected]]
> Sent: Tuesday, October 13, 2015 5:36 PM
> To: Anders Björnerstedt; Tony Hart
> Cc: [email protected]
> Subject: Re: [users] Avoid rebooting payload modules after losing system
> controller
> 
> Yes, I agree that the best fit for this feature is an application using 
> either the
> NWay-Active or the No-Redundancy models, and where you view the
> system more as a collection of nodes rather than as a cluster. This kind of
> architecture is quite common when you write applications for cloud. The
> redundancy models are suitable for scaling, and the architecture fits into the
> "cattle" philosophy which is common in cloud.
> Such an application can tolerate any number of node failures, and the
> remaining nodes would still be able to continue functioning and provide their
> service. However, if we put the OpenSAF middleware on the nodes it
> becomes the weakest link, since OpenSAF will reboot all the nodes just
> because the two controller nodes fail. What a pity on a system with one
> hundred nodes!
> 
> / Anders Widell
> 
> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote:
> >
> >
> > On 10/13/2015 12:27 PM, Anders Widell wrote:
> >> The possibility to have more than two system controllers (one active
> >> + several standby and/or spare controller nodes) is also something
> >> that has been investigated. For scalability reasons, we probably
> >> can't turn all nodes into standby controllers in a large cluster -
> >> but it may be feasible to have a system with one or several standby
> >> controllers and the rest of the nodes are spares that are ready to
> >> take an active or standby assignment when needed.
> >>
> >> However, the "headless" feature will still be needed in some systems
> >> where you need dedicated controller node(s).
> >
> > That sounds as if some deployments have a special requirement that can
> > only be supported by the headless feature.
> > But you also have to say that the headless feature places
> > anti-requirements on the deployments/applications that are to use it.
> >
> > For example not needing cluster coherence among the payloads.
> >
> > If the payloads only run independent application instances where each
> > instance is implemented at one processor or at least does not
> > communicate in any state-sensitive way with peer processes at other
> > payloads; and no such instance is unique or if it is unique it is
> > still expendable (non critical to the service), then it could work.
> >
> > It is important the the deployments that end up thinking they need the
> > headless feature also understand what they loose with the headless
> > feature and that this loss is acceptable for that deployment.
> >
> > So headless is not a fancy feature needed by some exclusive and picky
> > subset of applications.
> > It is a relaxation that drops all requirements on distributed
> > consistency and may be acceptable to some applications with weaker
> > demands so they can accept the anti requirements.
> >
> > Besides requiring "dedicated" controller nodes, the deployment must of
> > course NOT require any *availability* of those dedicated controller
> > nodes, i.e. not have any requirements on service availability in
> > general.
> >
> > It may works for some "dumb" applications that are stateless, or state
> > stable (frozen in state), or have no requirements on availability of
> > state. In other words some applicaitons that really dont need SAF.
> >
> > They may still want to use SAF as a way of managing and monitoring the
> > system when it happens to be healthy, but can live with  long periods
> > of not being able to manage or monitor that system, which can then be
> > degrading in any way that is possible.
> >
> >
> > /AndersBJ
> >
> >
> >
> >>
> >> / Anders Widell
> >>
> >> On 10/13/2015 12:07 PM, Tony Hart wrote:
> >>> Understood.  The assumption is that this is temporary but we allow
> >>> the payloads to continue to run (with reduced osaf functionality)
> >>> until a replacement controller is found.  At that point they can
> >>> reboot to get the system back into sync.
> >>>
> >>> Or allow more than 2 controllers in the system so we can have one or
> >>> more usually-payload cards be controllers to reduce the probability
> >>> of no-controllers to an acceptable level.
> >>>
> >>>
> >>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt
> >>>> <[email protected]> wrote:
> >>>>
> >>>> The headless state is also vulnerable to split-brain scenarios.
> >>>> That is network partitions and joins can occur and will not be
> >>>> detected as such and thus not handled properly (isolated) when they
> >>>> occur.
> >>>> Basically you can  not be sure you have a continuously coherent
> >>>> cluster while in the headless state.
> >>>>
> >>>> On paper you may get a very resilient system in the sense that It
> >>>> "stays up"  and replies on ping etc.
> >>>> But typically a customer wants not just availability but reliable
> >>>> behavior also.
> >>>>
> >>>> /AndersBj
> >>>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: Anders Björnerstedt
> [mailto:[email protected]]
> >>>> Sent: den 12 oktober 2015 16:42
> >>>> To: Anders Widell; Tony Hart; [email protected]
> >>>> Subject: Re: [users] Avoid rebooting payload modules after losing
> >>>> system controller
> >>>>
> >>>> Note that this headless variant  is a very questionable feature.
> >>>> This for the reasons explained earlier, i.e. you *will*  get a
> >>>> reduction in service availability.
> >>>> It was never accepted into OpenSAF for that reason.
> >>>>
> >>>> On top of that the unreliability will typically not he
> >>>> explicit/handled. That is the operator will probably not even know
> >>>> what is working and what is not during the SC absence since the
> >>>> alarm/notification  function is gone. No OpenSAF director services
> >>>> are executing.
> >>>>
> >>>> It is truly a headless system, i.e. a zombie system and thus not
> >>>> working at full monitoring and availability functionality.
> >>>> It begs the question of what OpenSAF and SAF is there for in the
> >>>> first place.
> >>>>
> >>>> The SCs don’t have to run any special software and don’t have to
> >>>> have any special hardware.
> >>>> They do need file system access, at least for a cluster restart,
> >>>> but not necessarily to handle single SC failure.
> >>>> The headless variant when headless is also in that
> >>>> not-able-to-cluster-restart also, but with even less functionality.
> >>>>
> >>>> An SC can of course run other (non OpenSAF specific) software.  And
> >>>> the two SCs don’t necessarily have to be symmetric in terms of
> >>>> software.
> >>>>
> >>>> Providing file system access via NFS is typically a non issue. They
> >>>> have three nodes. Ergo  they should be able to assign two of them
> >>>> the role of SC in the OpensAF domain.
> >>>>
> >>>> /AndersBj
> >>>>
> >>>> -----Original Message-----
> >>>> From: Anders Widell [mailto:[email protected]]
> >>>> Sent: den 12 oktober 2015 16:08
> >>>> To: Tony Hart; [email protected]
> >>>> Subject: Re: [users] Avoid rebooting payload modules after losing
> >>>> system controller
> >>>>
> >>>> We have actually implemented something very similar to what you are
> >>>> talking about. With this feature, the payloads can survive without
> >>>> a cluster restart even if both system controllers restart (or the
> >>>> single system controller, in your case). If you want to try it out,
> >>>> you can clone this Mercurial repository:
> >>>>
> >>>> https://sourceforge.net/u/anders-w/opensaf-headless/
> >>>>
> >>>> To enable the feature, set the variable
> IMMSV_SC_ABSENCE_ALLOWED in
> >>>> immd.conf to the amount of seconds you wish the payloads to wait
> >>>> for the system controllers to come back. Note: we have only
> >>>> implemented this feature for the "core" OpenSAF services (plus
> >>>> CKPT), so you need to disable the optional serivces.
> >>>>
> >>>> / Anders Widell
> >>>>
> >>>> On 10/11/2015 02:30 PM, Tony Hart wrote:
> >>>>> We have been using opensaf in our product for a couple of years
> >>>>> now.  One of the issues we have is the fact that payload cards
> >>>>> reboot when the system controllers are lost.  Although our payload
> >>>>> card hardware will continue to perform its functions whilst the
> >>>>> software is down (which is desirable) the functions that the
> >>>>> software performs are obviously not performed (which is not
> >>>>> desirable).
> >>>>>
> >>>>> Why would we loose both controllers, surely that is a rare
> >>>>> circumstance?  Not if you only have one controller to begin with.
> >>>>> Removing the second controller is a significant cost saving for us
> >>>>> so we want to support a product that only has one controller.  The
> >>>>> most significant impediment to that is the loss of payload
> >>>>> software functions when the system controller fails.
> >>>>>
> >>>>> I’m looking for suggestions from this email list as to what could
> >>>>> be done for this issue.
> >>>>>
> >>>>> One suggestion, that would work for us, is if we could convince
> >>>>> the payload card to only reboot when the controller reappears
> >>>>> after a loss rather than when the loss initially occurs.  Is that
> >>>>> possible?
> >>>>>
> >>>>> Another possibility is if we could support more than 2
> >>>>> controllers, for example if we could support 4 (one active and 3
> >>>>> standbys) that would also provide a solution for us (our current
> >>>>> payloads would instead become controllers). I know that this is
> >>>>> not currently possible with opensaf.
> >>>>>
> >>>>> thanks for any suggestions,
> >>>>> —
> >>>>> tony
> >>>>> ------------------------------------------------------------------
> >>>>> ----
> >>>>>
> >>>>> -------- _______________________________________________
> >>>>> Opensaf-users mailing list
> >>>>> [email protected]
> >>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >>>>
> >>>>
> >>>> -------------------------------------------------------------------
> >>>> -----------
> >>>>
> >>>> _______________________________________________
> >>>> Opensaf-users mailing list
> >>>> [email protected]
> >>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >>>> -------------------------------------------------------------------
> >>>> -----------
> >>>>
> >>>> _______________________________________________
> >>>> Opensaf-users mailing list
> >>>> [email protected]
> >>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >>
> >>
> >
> >
> >
> 
> 
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-users

------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Avoid rebooting payload modules after losing system controller

Reply via email to