Understood. The assumption is that this is temporary but we allow the payloads to continue to run (with reduced osaf functionality) until a replacement controller is found. At that point they can reboot to get the system back into sync.
Or allow more than 2 controllers in the system so we can have one or more usually-payload cards be controllers to reduce the probability of no-controllers to an acceptable level. > On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt > <[email protected]> wrote: > > The headless state is also vulnerable to split-brain scenarios. > That is network partitions and joins can occur and will not be detected as > such and thus not handled properly (isolated) when they occur. > Basically you can not be sure you have a continuously coherent cluster while > in the headless state. > > On paper you may get a very resilient system in the sense that It "stays up" > and replies on ping etc. > But typically a customer wants not just availability but reliable behavior > also. > > /AndersBj > > > -----Original Message----- > From: Anders Björnerstedt [mailto:[email protected]] > Sent: den 12 oktober 2015 16:42 > To: Anders Widell; Tony Hart; [email protected] > Subject: Re: [users] Avoid rebooting payload modules after losing system > controller > > Note that this headless variant is a very questionable feature. This for the > reasons explained earlier, i.e. you *will* get a reduction in service > availability. > It was never accepted into OpenSAF for that reason. > > On top of that the unreliability will typically not he explicit/handled. That > is the operator will probably not even know what is working and what is not > during the SC absence since the alarm/notification function is gone. No > OpenSAF director services are executing. > > It is truly a headless system, i.e. a zombie system and thus not working at > full monitoring and availability functionality. > It begs the question of what OpenSAF and SAF is there for in the first place. > > The SCs don’t have to run any special software and don’t have to have any > special hardware. > They do need file system access, at least for a cluster restart, but not > necessarily to handle single SC failure. > The headless variant when headless is also in that > not-able-to-cluster-restart also, but with even less functionality. > > An SC can of course run other (non OpenSAF specific) software. And the two > SCs don’t necessarily have to be symmetric in terms of software. > > Providing file system access via NFS is typically a non issue. They have > three nodes. Ergo they should be able to assign two of them the role of SC > in the OpensAF domain. > > /AndersBj > > -----Original Message----- > From: Anders Widell [mailto:[email protected]] > Sent: den 12 oktober 2015 16:08 > To: Tony Hart; [email protected] > Subject: Re: [users] Avoid rebooting payload modules after losing system > controller > > We have actually implemented something very similar to what you are talking > about. With this feature, the payloads can survive without a cluster restart > even if both system controllers restart (or the single system controller, in > your case). If you want to try it out, you can clone this Mercurial > repository: > > https://sourceforge.net/u/anders-w/opensaf-headless/ > > To enable the feature, set the variable IMMSV_SC_ABSENCE_ALLOWED in immd.conf > to the amount of seconds you wish the payloads to wait for the system > controllers to come back. Note: we have only implemented this feature for the > "core" OpenSAF services (plus CKPT), so you need to disable the optional > serivces. > > / Anders Widell > > On 10/11/2015 02:30 PM, Tony Hart wrote: >> We have been using opensaf in our product for a couple of years now. One of >> the issues we have is the fact that payload cards reboot when the system >> controllers are lost. Although our payload card hardware will continue to >> perform its functions whilst the software is down (which is desirable) the >> functions that the software performs are obviously not performed (which is >> not desirable). >> >> Why would we loose both controllers, surely that is a rare circumstance? >> Not if you only have one controller to begin with. Removing the second >> controller is a significant cost saving for us so we want to support a >> product that only has one controller. The most significant impediment to >> that is the loss of payload software functions when the system controller >> fails. >> >> I’m looking for suggestions from this email list as to what could be done >> for this issue. >> >> One suggestion, that would work for us, is if we could convince the payload >> card to only reboot when the controller reappears after a loss rather than >> when the loss initially occurs. Is that possible? >> >> Another possibility is if we could support more than 2 controllers, for >> example if we could support 4 (one active and 3 standbys) that would also >> provide a solution for us (our current payloads would instead become >> controllers). I know that this is not currently possible with opensaf. >> >> thanks for any suggestions, >> — >> tony >> ---------------------------------------------------------------------- >> -------- _______________________________________________ >> Opensaf-users mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/opensaf-users > > > > ------------------------------------------------------------------------------ > _______________________________________________ > Opensaf-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-users > ------------------------------------------------------------------------------ > _______________________________________________ > Opensaf-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-users ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
