> On Oct 13, 2015, at 6:23 AM, Mathivanan Naickan Palanivelu > <[email protected]> wrote: > > I don't think this scenario qualifies (not yet) to gain direct relevance to > the multiple-standby concepts yet(headless sounds a misnomer to me, that > apart)! > Could you please explain the following: > - What is the redundancy model of the distributed applications running on > those 3 nodes i.e. (Server, linecard1, linecard2)
Currently Server=system-controller, linecard* = payload. This leads to SOPF, what I would like is all cards to be system-controllers. Technically two controllers would be enough but then I have to decide which linecard is the other controller and handle the case where that linecard is removed. The redundancy model would be active-standby for the “controller” processes and no-redundancy for the linecard processes, here we would run both controller and linecard processes on the linecards. > - Are your applications running as SA-AWARE or non-SA-AWARE? Most are ha-aware naturally, there are a small number that are not naturally ha-aware, for those we have ha-aware wrapper processes. So I think the answer is all processes are HA-aware. > - What would be the typical heartbeat interval period. If you mean TIPC-timeout, its currently set to 10 seconds. > > You would require a customized handling (like delaying or avoiding reboot for > a while) > > Thanks, > Mathi. > >> -----Original Message----- >> From: Tony Hart [mailto:[email protected]] >> Sent: Tuesday, October 13, 2015 3:37 PM >> To: Anders Björnerstedt >> Cc: [email protected] >> Subject: Re: [users] Avoid rebooting payload modules after losing system >> controller >> >> >> Understood. The assumption is that this is temporary but we allow the >> payloads to continue to run (with reduced osaf functionality) until a >> replacement controller is found. At that point they can reboot to get the >> system back into sync. >> >> Or allow more than 2 controllers in the system so we can have one or more >> usually-payload cards be controllers to reduce the probability of no- >> controllers to an acceptable level. >> >> >>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt >> <[email protected]> wrote: >>> >>> The headless state is also vulnerable to split-brain scenarios. >>> That is network partitions and joins can occur and will not be detected as >> such and thus not handled properly (isolated) when they occur. >>> Basically you can not be sure you have a continuously coherent cluster >> while in the headless state. >>> >>> On paper you may get a very resilient system in the sense that It "stays up" >> and replies on ping etc. >>> But typically a customer wants not just availability but reliable behavior >> also. >>> >>> /AndersBj >>> >>> >>> -----Original Message----- >>> From: Anders Björnerstedt [mailto:[email protected]] >>> Sent: den 12 oktober 2015 16:42 >>> To: Anders Widell; Tony Hart; [email protected] >>> Subject: Re: [users] Avoid rebooting payload modules after losing >>> system controller >>> >>> Note that this headless variant is a very questionable feature. This for >>> the >> reasons explained earlier, i.e. you *will* get a reduction in service >> availability. >>> It was never accepted into OpenSAF for that reason. >>> >>> On top of that the unreliability will typically not he explicit/handled. >>> That is >> the operator will probably not even know what is working and what is not >> during the SC absence since the alarm/notification function is gone. No >> OpenSAF director services are executing. >>> >>> It is truly a headless system, i.e. a zombie system and thus not working at >> full monitoring and availability functionality. >>> It begs the question of what OpenSAF and SAF is there for in the first >>> place. >>> >>> The SCs don’t have to run any special software and don’t have to have any >> special hardware. >>> They do need file system access, at least for a cluster restart, but not >> necessarily to handle single SC failure. >>> The headless variant when headless is also in that not-able-to-cluster- >> restart also, but with even less functionality. >>> >>> An SC can of course run other (non OpenSAF specific) software. And the >> two SCs don’t necessarily have to be symmetric in terms of software. >>> >>> Providing file system access via NFS is typically a non issue. They have >>> three >> nodes. Ergo they should be able to assign two of them the role of SC in the >> OpensAF domain. >>> >>> /AndersBj >>> >>> -----Original Message----- >>> From: Anders Widell [mailto:[email protected]] >>> Sent: den 12 oktober 2015 16:08 >>> To: Tony Hart; [email protected] >>> Subject: Re: [users] Avoid rebooting payload modules after losing >>> system controller >>> >>> We have actually implemented something very similar to what you are >> talking about. With this feature, the payloads can survive without a cluster >> restart even if both system controllers restart (or the single system >> controller, in your case). If you want to try it out, you can clone this >> Mercurial >> repository: >>> >>> https://sourceforge.net/u/anders-w/opensaf-headless/ >>> >>> To enable the feature, set the variable IMMSV_SC_ABSENCE_ALLOWED in >> immd.conf to the amount of seconds you wish the payloads to wait for the >> system controllers to come back. Note: we have only implemented this >> feature for the "core" OpenSAF services (plus CKPT), so you need to disable >> the optional serivces. >>> >>> / Anders Widell >>> >>> On 10/11/2015 02:30 PM, Tony Hart wrote: >>>> We have been using opensaf in our product for a couple of years now. >> One of the issues we have is the fact that payload cards reboot when the >> system controllers are lost. Although our payload card hardware will >> continue to perform its functions whilst the software is down (which is >> desirable) the functions that the software performs are obviously not >> performed (which is not desirable). >>>> >>>> Why would we loose both controllers, surely that is a rare circumstance? >> Not if you only have one controller to begin with. Removing the second >> controller is a significant cost saving for us so we want to support a >> product >> that only has one controller. The most significant impediment to that is the >> loss of payload software functions when the system controller fails. >>>> >>>> I’m looking for suggestions from this email list as to what could be done >> for this issue. >>>> >>>> One suggestion, that would work for us, is if we could convince the >> payload card to only reboot when the controller reappears after a loss rather >> than when the loss initially occurs. Is that possible? >>>> >>>> Another possibility is if we could support more than 2 controllers, for >> example if we could support 4 (one active and 3 standbys) that would also >> provide a solution for us (our current payloads would instead become >> controllers). I know that this is not currently possible with opensaf. >>>> >>>> thanks for any suggestions, >>>> — >>>> tony >>>> --------------------------------------------------------------------- >>>> - >>>> -------- _______________________________________________ >>>> Opensaf-users mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>> >>> >>> >>> ---------------------------------------------------------------------- >>> -------- _______________________________________________ >>> Opensaf-users mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>> ---------------------------------------------------------------------- >>> -------- _______________________________________________ >>> Opensaf-users mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Opensaf-users mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/opensaf-users ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
