I don't think this is a case of cattles! Even in those scenario the cloud management stacks, the "**controller" software themselves are 'placed' on physical nodes in appropriate redundancy models and not inside those cattle VMs!
I think the case here is about avoid rebooting of the node! This can be achieved by setting the NOACTIVE timer to a longer value till OpenSAF on the controller comes back up. Upon detecting that the controllers are up, some entity on the local node restart OpenSAF (/etc/init.d/opensafd restart) And ensure the CLC-CLI scripts of the applications differentiate usual restart versus this spoof-restart! Mathi. > -----Original Message----- > From: Anders Widell [mailto:[email protected]] > Sent: Tuesday, October 13, 2015 5:36 PM > To: Anders Björnerstedt; Tony Hart > Cc: [email protected] > Subject: Re: [users] Avoid rebooting payload modules after losing system > controller > > Yes, I agree that the best fit for this feature is an application using > either the > NWay-Active or the No-Redundancy models, and where you view the > system more as a collection of nodes rather than as a cluster. This kind of > architecture is quite common when you write applications for cloud. The > redundancy models are suitable for scaling, and the architecture fits into the > "cattle" philosophy which is common in cloud. > Such an application can tolerate any number of node failures, and the > remaining nodes would still be able to continue functioning and provide their > service. However, if we put the OpenSAF middleware on the nodes it > becomes the weakest link, since OpenSAF will reboot all the nodes just > because the two controller nodes fail. What a pity on a system with one > hundred nodes! > > / Anders Widell > > On 10/13/2015 01:19 PM, Anders Björnerstedt wrote: > > > > > > On 10/13/2015 12:27 PM, Anders Widell wrote: > >> The possibility to have more than two system controllers (one active > >> + several standby and/or spare controller nodes) is also something > >> that has been investigated. For scalability reasons, we probably > >> can't turn all nodes into standby controllers in a large cluster - > >> but it may be feasible to have a system with one or several standby > >> controllers and the rest of the nodes are spares that are ready to > >> take an active or standby assignment when needed. > >> > >> However, the "headless" feature will still be needed in some systems > >> where you need dedicated controller node(s). > > > > That sounds as if some deployments have a special requirement that can > > only be supported by the headless feature. > > But you also have to say that the headless feature places > > anti-requirements on the deployments/applications that are to use it. > > > > For example not needing cluster coherence among the payloads. > > > > If the payloads only run independent application instances where each > > instance is implemented at one processor or at least does not > > communicate in any state-sensitive way with peer processes at other > > payloads; and no such instance is unique or if it is unique it is > > still expendable (non critical to the service), then it could work. > > > > It is important the the deployments that end up thinking they need the > > headless feature also understand what they loose with the headless > > feature and that this loss is acceptable for that deployment. > > > > So headless is not a fancy feature needed by some exclusive and picky > > subset of applications. > > It is a relaxation that drops all requirements on distributed > > consistency and may be acceptable to some applications with weaker > > demands so they can accept the anti requirements. > > > > Besides requiring "dedicated" controller nodes, the deployment must of > > course NOT require any *availability* of those dedicated controller > > nodes, i.e. not have any requirements on service availability in > > general. > > > > It may works for some "dumb" applications that are stateless, or state > > stable (frozen in state), or have no requirements on availability of > > state. In other words some applicaitons that really dont need SAF. > > > > They may still want to use SAF as a way of managing and monitoring the > > system when it happens to be healthy, but can live with long periods > > of not being able to manage or monitor that system, which can then be > > degrading in any way that is possible. > > > > > > /AndersBJ > > > > > > > >> > >> / Anders Widell > >> > >> On 10/13/2015 12:07 PM, Tony Hart wrote: > >>> Understood. The assumption is that this is temporary but we allow > >>> the payloads to continue to run (with reduced osaf functionality) > >>> until a replacement controller is found. At that point they can > >>> reboot to get the system back into sync. > >>> > >>> Or allow more than 2 controllers in the system so we can have one or > >>> more usually-payload cards be controllers to reduce the probability > >>> of no-controllers to an acceptable level. > >>> > >>> > >>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt > >>>> <[email protected]> wrote: > >>>> > >>>> The headless state is also vulnerable to split-brain scenarios. > >>>> That is network partitions and joins can occur and will not be > >>>> detected as such and thus not handled properly (isolated) when they > >>>> occur. > >>>> Basically you can not be sure you have a continuously coherent > >>>> cluster while in the headless state. > >>>> > >>>> On paper you may get a very resilient system in the sense that It > >>>> "stays up" and replies on ping etc. > >>>> But typically a customer wants not just availability but reliable > >>>> behavior also. > >>>> > >>>> /AndersBj > >>>> > >>>> > >>>> -----Original Message----- > >>>> From: Anders Björnerstedt > [mailto:[email protected]] > >>>> Sent: den 12 oktober 2015 16:42 > >>>> To: Anders Widell; Tony Hart; [email protected] > >>>> Subject: Re: [users] Avoid rebooting payload modules after losing > >>>> system controller > >>>> > >>>> Note that this headless variant is a very questionable feature. > >>>> This for the reasons explained earlier, i.e. you *will* get a > >>>> reduction in service availability. > >>>> It was never accepted into OpenSAF for that reason. > >>>> > >>>> On top of that the unreliability will typically not he > >>>> explicit/handled. That is the operator will probably not even know > >>>> what is working and what is not during the SC absence since the > >>>> alarm/notification function is gone. No OpenSAF director services > >>>> are executing. > >>>> > >>>> It is truly a headless system, i.e. a zombie system and thus not > >>>> working at full monitoring and availability functionality. > >>>> It begs the question of what OpenSAF and SAF is there for in the > >>>> first place. > >>>> > >>>> The SCs don’t have to run any special software and don’t have to > >>>> have any special hardware. > >>>> They do need file system access, at least for a cluster restart, > >>>> but not necessarily to handle single SC failure. > >>>> The headless variant when headless is also in that > >>>> not-able-to-cluster-restart also, but with even less functionality. > >>>> > >>>> An SC can of course run other (non OpenSAF specific) software. And > >>>> the two SCs don’t necessarily have to be symmetric in terms of > >>>> software. > >>>> > >>>> Providing file system access via NFS is typically a non issue. They > >>>> have three nodes. Ergo they should be able to assign two of them > >>>> the role of SC in the OpensAF domain. > >>>> > >>>> /AndersBj > >>>> > >>>> -----Original Message----- > >>>> From: Anders Widell [mailto:[email protected]] > >>>> Sent: den 12 oktober 2015 16:08 > >>>> To: Tony Hart; [email protected] > >>>> Subject: Re: [users] Avoid rebooting payload modules after losing > >>>> system controller > >>>> > >>>> We have actually implemented something very similar to what you are > >>>> talking about. With this feature, the payloads can survive without > >>>> a cluster restart even if both system controllers restart (or the > >>>> single system controller, in your case). If you want to try it out, > >>>> you can clone this Mercurial repository: > >>>> > >>>> https://sourceforge.net/u/anders-w/opensaf-headless/ > >>>> > >>>> To enable the feature, set the variable > IMMSV_SC_ABSENCE_ALLOWED in > >>>> immd.conf to the amount of seconds you wish the payloads to wait > >>>> for the system controllers to come back. Note: we have only > >>>> implemented this feature for the "core" OpenSAF services (plus > >>>> CKPT), so you need to disable the optional serivces. > >>>> > >>>> / Anders Widell > >>>> > >>>> On 10/11/2015 02:30 PM, Tony Hart wrote: > >>>>> We have been using opensaf in our product for a couple of years > >>>>> now. One of the issues we have is the fact that payload cards > >>>>> reboot when the system controllers are lost. Although our payload > >>>>> card hardware will continue to perform its functions whilst the > >>>>> software is down (which is desirable) the functions that the > >>>>> software performs are obviously not performed (which is not > >>>>> desirable). > >>>>> > >>>>> Why would we loose both controllers, surely that is a rare > >>>>> circumstance? Not if you only have one controller to begin with. > >>>>> Removing the second controller is a significant cost saving for us > >>>>> so we want to support a product that only has one controller. The > >>>>> most significant impediment to that is the loss of payload > >>>>> software functions when the system controller fails. > >>>>> > >>>>> I’m looking for suggestions from this email list as to what could > >>>>> be done for this issue. > >>>>> > >>>>> One suggestion, that would work for us, is if we could convince > >>>>> the payload card to only reboot when the controller reappears > >>>>> after a loss rather than when the loss initially occurs. Is that > >>>>> possible? > >>>>> > >>>>> Another possibility is if we could support more than 2 > >>>>> controllers, for example if we could support 4 (one active and 3 > >>>>> standbys) that would also provide a solution for us (our current > >>>>> payloads would instead become controllers). I know that this is > >>>>> not currently possible with opensaf. > >>>>> > >>>>> thanks for any suggestions, > >>>>> — > >>>>> tony > >>>>> ------------------------------------------------------------------ > >>>>> ---- > >>>>> > >>>>> -------- _______________________________________________ > >>>>> Opensaf-users mailing list > >>>>> [email protected] > >>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users > >>>> > >>>> > >>>> ------------------------------------------------------------------- > >>>> ----------- > >>>> > >>>> _______________________________________________ > >>>> Opensaf-users mailing list > >>>> [email protected] > >>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users > >>>> ------------------------------------------------------------------- > >>>> ----------- > >>>> > >>>> _______________________________________________ > >>>> Opensaf-users mailing list > >>>> [email protected] > >>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users > >> > >> > > > > > > > > > > ------------------------------------------------------------------------------ > _______________________________________________ > Opensaf-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-users ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
