Yes, this is yet another approach. But it is also another use-case for the headless feature. When we have moved the system controllers out of the cluster (into the cloud infrastructure), I would expect controllers and payloads to have independent life cycles. You have servers (i.e. system controllers), and clients (payloads). They can be installed and upgraded separately from each other, and I wouldn't expect a restart of the servers to cause all the clients to restart as well, in the same way as I don't expect my web browser to restart just because because the web server has crashed.
/ Anders Widell On 10/13/2015 03:54 PM, Mathivanan Naickan Palanivelu wrote: > I don't think this is a case of cattles! Even in those scenario > the cloud management stacks, the "**controller" software themselves are > 'placed' on physical nodes > in appropriate redundancy models and not inside those cattle VMs! > > I think the case here is about avoid rebooting of the node! > This can be achieved by setting the NOACTIVE timer to a longer value till > OpenSAF on the controller comes back up. > Upon detecting that the controllers are up, some entity on the local node > restart OpenSAF (/etc/init.d/opensafd restart) > And ensure the CLC-CLI scripts of the applications differentiate usual > restart versus this spoof-restart! > > Mathi. > >> -----Original Message----- >> From: Anders Widell [mailto:[email protected]] >> Sent: Tuesday, October 13, 2015 5:36 PM >> To: Anders Björnerstedt; Tony Hart >> Cc: [email protected] >> Subject: Re: [users] Avoid rebooting payload modules after losing system >> controller >> >> Yes, I agree that the best fit for this feature is an application using >> either the >> NWay-Active or the No-Redundancy models, and where you view the >> system more as a collection of nodes rather than as a cluster. This kind of >> architecture is quite common when you write applications for cloud. The >> redundancy models are suitable for scaling, and the architecture fits into >> the >> "cattle" philosophy which is common in cloud. >> Such an application can tolerate any number of node failures, and the >> remaining nodes would still be able to continue functioning and provide their >> service. However, if we put the OpenSAF middleware on the nodes it >> becomes the weakest link, since OpenSAF will reboot all the nodes just >> because the two controller nodes fail. What a pity on a system with one >> hundred nodes! >> >> / Anders Widell >> >> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote: >>> >>> On 10/13/2015 12:27 PM, Anders Widell wrote: >>>> The possibility to have more than two system controllers (one active >>>> + several standby and/or spare controller nodes) is also something >>>> that has been investigated. For scalability reasons, we probably >>>> can't turn all nodes into standby controllers in a large cluster - >>>> but it may be feasible to have a system with one or several standby >>>> controllers and the rest of the nodes are spares that are ready to >>>> take an active or standby assignment when needed. >>>> >>>> However, the "headless" feature will still be needed in some systems >>>> where you need dedicated controller node(s). >>> That sounds as if some deployments have a special requirement that can >>> only be supported by the headless feature. >>> But you also have to say that the headless feature places >>> anti-requirements on the deployments/applications that are to use it. >>> >>> For example not needing cluster coherence among the payloads. >>> >>> If the payloads only run independent application instances where each >>> instance is implemented at one processor or at least does not >>> communicate in any state-sensitive way with peer processes at other >>> payloads; and no such instance is unique or if it is unique it is >>> still expendable (non critical to the service), then it could work. >>> >>> It is important the the deployments that end up thinking they need the >>> headless feature also understand what they loose with the headless >>> feature and that this loss is acceptable for that deployment. >>> >>> So headless is not a fancy feature needed by some exclusive and picky >>> subset of applications. >>> It is a relaxation that drops all requirements on distributed >>> consistency and may be acceptable to some applications with weaker >>> demands so they can accept the anti requirements. >>> >>> Besides requiring "dedicated" controller nodes, the deployment must of >>> course NOT require any *availability* of those dedicated controller >>> nodes, i.e. not have any requirements on service availability in >>> general. >>> >>> It may works for some "dumb" applications that are stateless, or state >>> stable (frozen in state), or have no requirements on availability of >>> state. In other words some applicaitons that really dont need SAF. >>> >>> They may still want to use SAF as a way of managing and monitoring the >>> system when it happens to be healthy, but can live with long periods >>> of not being able to manage or monitor that system, which can then be >>> degrading in any way that is possible. >>> >>> >>> /AndersBJ >>> >>> >>> >>>> / Anders Widell >>>> >>>> On 10/13/2015 12:07 PM, Tony Hart wrote: >>>>> Understood. The assumption is that this is temporary but we allow >>>>> the payloads to continue to run (with reduced osaf functionality) >>>>> until a replacement controller is found. At that point they can >>>>> reboot to get the system back into sync. >>>>> >>>>> Or allow more than 2 controllers in the system so we can have one or >>>>> more usually-payload cards be controllers to reduce the probability >>>>> of no-controllers to an acceptable level. >>>>> >>>>> >>>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt >>>>>> <[email protected]> wrote: >>>>>> >>>>>> The headless state is also vulnerable to split-brain scenarios. >>>>>> That is network partitions and joins can occur and will not be >>>>>> detected as such and thus not handled properly (isolated) when they >>>>>> occur. >>>>>> Basically you can not be sure you have a continuously coherent >>>>>> cluster while in the headless state. >>>>>> >>>>>> On paper you may get a very resilient system in the sense that It >>>>>> "stays up" and replies on ping etc. >>>>>> But typically a customer wants not just availability but reliable >>>>>> behavior also. >>>>>> >>>>>> /AndersBj >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Anders Björnerstedt >> [mailto:[email protected]] >>>>>> Sent: den 12 oktober 2015 16:42 >>>>>> To: Anders Widell; Tony Hart; [email protected] >>>>>> Subject: Re: [users] Avoid rebooting payload modules after losing >>>>>> system controller >>>>>> >>>>>> Note that this headless variant is a very questionable feature. >>>>>> This for the reasons explained earlier, i.e. you *will* get a >>>>>> reduction in service availability. >>>>>> It was never accepted into OpenSAF for that reason. >>>>>> >>>>>> On top of that the unreliability will typically not he >>>>>> explicit/handled. That is the operator will probably not even know >>>>>> what is working and what is not during the SC absence since the >>>>>> alarm/notification function is gone. No OpenSAF director services >>>>>> are executing. >>>>>> >>>>>> It is truly a headless system, i.e. a zombie system and thus not >>>>>> working at full monitoring and availability functionality. >>>>>> It begs the question of what OpenSAF and SAF is there for in the >>>>>> first place. >>>>>> >>>>>> The SCs don’t have to run any special software and don’t have to >>>>>> have any special hardware. >>>>>> They do need file system access, at least for a cluster restart, >>>>>> but not necessarily to handle single SC failure. >>>>>> The headless variant when headless is also in that >>>>>> not-able-to-cluster-restart also, but with even less functionality. >>>>>> >>>>>> An SC can of course run other (non OpenSAF specific) software. And >>>>>> the two SCs don’t necessarily have to be symmetric in terms of >>>>>> software. >>>>>> >>>>>> Providing file system access via NFS is typically a non issue. They >>>>>> have three nodes. Ergo they should be able to assign two of them >>>>>> the role of SC in the OpensAF domain. >>>>>> >>>>>> /AndersBj >>>>>> >>>>>> -----Original Message----- >>>>>> From: Anders Widell [mailto:[email protected]] >>>>>> Sent: den 12 oktober 2015 16:08 >>>>>> To: Tony Hart; [email protected] >>>>>> Subject: Re: [users] Avoid rebooting payload modules after losing >>>>>> system controller >>>>>> >>>>>> We have actually implemented something very similar to what you are >>>>>> talking about. With this feature, the payloads can survive without >>>>>> a cluster restart even if both system controllers restart (or the >>>>>> single system controller, in your case). If you want to try it out, >>>>>> you can clone this Mercurial repository: >>>>>> >>>>>> https://sourceforge.net/u/anders-w/opensaf-headless/ >>>>>> >>>>>> To enable the feature, set the variable >> IMMSV_SC_ABSENCE_ALLOWED in >>>>>> immd.conf to the amount of seconds you wish the payloads to wait >>>>>> for the system controllers to come back. Note: we have only >>>>>> implemented this feature for the "core" OpenSAF services (plus >>>>>> CKPT), so you need to disable the optional serivces. >>>>>> >>>>>> / Anders Widell >>>>>> >>>>>> On 10/11/2015 02:30 PM, Tony Hart wrote: >>>>>>> We have been using opensaf in our product for a couple of years >>>>>>> now. One of the issues we have is the fact that payload cards >>>>>>> reboot when the system controllers are lost. Although our payload >>>>>>> card hardware will continue to perform its functions whilst the >>>>>>> software is down (which is desirable) the functions that the >>>>>>> software performs are obviously not performed (which is not >>>>>>> desirable). >>>>>>> >>>>>>> Why would we loose both controllers, surely that is a rare >>>>>>> circumstance? Not if you only have one controller to begin with. >>>>>>> Removing the second controller is a significant cost saving for us >>>>>>> so we want to support a product that only has one controller. The >>>>>>> most significant impediment to that is the loss of payload >>>>>>> software functions when the system controller fails. >>>>>>> >>>>>>> I’m looking for suggestions from this email list as to what could >>>>>>> be done for this issue. >>>>>>> >>>>>>> One suggestion, that would work for us, is if we could convince >>>>>>> the payload card to only reboot when the controller reappears >>>>>>> after a loss rather than when the loss initially occurs. Is that >>>>>>> possible? >>>>>>> >>>>>>> Another possibility is if we could support more than 2 >>>>>>> controllers, for example if we could support 4 (one active and 3 >>>>>>> standbys) that would also provide a solution for us (our current >>>>>>> payloads would instead become controllers). I know that this is >>>>>>> not currently possible with opensaf. >>>>>>> >>>>>>> thanks for any suggestions, >>>>>>> — >>>>>>> tony >>>>>>> ------------------------------------------------------------------ >>>>>>> ---- >>>>>>> >>>>>>> -------- _______________________________________________ >>>>>>> Opensaf-users mailing list >>>>>>> [email protected] >>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>>>>> >>>>>> ------------------------------------------------------------------- >>>>>> ----------- >>>>>> >>>>>> _______________________________________________ >>>>>> Opensaf-users mailing list >>>>>> [email protected] >>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>>>>> ------------------------------------------------------------------- >>>>>> ----------- >>>>>> >>>>>> _______________________________________________ >>>>>> Opensaf-users mailing list >>>>>> [email protected] >>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>>> >>> >>> >> >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Opensaf-users mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/opensaf-users ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
