please see a correction inline: ----- [email protected] wrote:
> Hi Tony, > > Yes, it has been agreed to support more than two controllers in a > phased manner. > > The ticket http://sourceforge.net/p/opensaf/tickets/79/ will be > available > as a part of the next major release 5.0 in April 2016. > This ticket #79 will provide support for configuring spare system > controllers. > i.e. Everytime a controller dies, a spare will be promoted as an > ACTIVE. A spare node will be promoted as an SC. Mathi. > In your case, one of the line cards can be configured as a spare! > Please see the attachment in the ticket for a overview of the > feature. > > > At the same time, it has been agreed to include the solution(in 5.0) > also in > https://sourceforge.net/u/anders-w/opensaf-headless/ (currently > maintained as a fork) into the main branch. This feature > will be configurable to be turned on/off. > > > Subsequently, 5.1 will have support for a mix of multiplestandbys + > spares. > > The details if this has been updated in the following two tickets: > http://sourceforge.net/p/opensaf/tickets/439/ > and > http://sourceforge.net/p/opensaf/tickets/1170/ > > > Cheers, > Mathi. > > > > > > > ----- [email protected] wrote: > > > Hi Mathi, > > Any updates on this topic? > > thanks > > — > > tony > > > > > > > On Oct 14, 2015, at 7:50 AM, Mathivanan Naickan Palanivelu > > <[email protected]> wrote: > > > > > > Tony, > > > > > > We(in the TLC) have discussed this before. At this point of time > > > you indeed have the solution that AndersW has pointed to. > > > > > > There are also other discussions going on w.r.t multiple-standbys > > etc > > > for the next major release. Will tag you in the appropriate > ticket > > > (one such ticket is > http://sourceforge.net/p/opensaf/tickets/1170/) > > > once those discussions conclude. > > > > > > Cheers, > > > Mathi. > > > > > > ----- [email protected] wrote: > > > > > >> Thanks all, this is a good discussion. > > >> > > >> Let me again state my use case, which as I said I don’t think is > > >> unique, because selfishly I want to make sure that this is > covered > > >> :-) > > >> > > >> I want to keep the system running in the case where both > > controller > > >> fail, or more specifically in my case, when the only server in > the > > >> system fails (cost reduced system). > > >> > > >> The "don’t reboot until a server re-appears" would work for me. > I > > >> think its a poorer solution but it would be workable (Mathi’s > > >> suggestion). Especially if its available sooner. > > >> > > >> A better option is to allow one of the linecards to take over > the > > >> controller role when the preferred one dies. Since linecards > are > > >> assumed to be transitory (they may or may not be there or could > be > > >> removed) so I don’t want to have to pick a linecard for this > role > > at > > >> boot time. Much better would be to allow some number (more > than > > 2) > > >> nodes to be controllers (e.g. Active, Standby, Spare, Spare …). > > Then > > >> OSAF takes responsibility for electing the next Active/Standby > > when > > >> necessary. There would have to be a concept of preference such > > that > > >> the Server node is chosen given a choice. > > >> > > >> For the solutions discussed so far… > > >> > > >> Making the OSAF processes restartable and be active on either > > >> controller, is good and will increase the MTBF of an OSAF system. > > > >> However it won’t fix my problem if there are still only two > > >> controllers allowed. > > >> > > >> I’m not familiar with the “cattle” concept. > > >> > > >> I wasn’t advocating the “headless” solution only to the extent > that > > it > > >> solved my immediate problem. Even if we did have a “headless” > mode > > it > > >> would be a temporary situation until the failed controller could > > be > > >> replaced. > > >> > > >> thanks > > >> — > > >> tony > > >> > > >> > > >>> On Oct 14, 2015, at 6:44 AM, Mathivanan Naickan Palanivelu > > >> <[email protected]> wrote: > > >>> > > >>> In a clustered environment, all nodes need to be in the same > > >> consistent view > > >>> from a membership perspective. > > >>> So, loss of a leader will indeed result in state changes to the > > >> nodes > > >>> following the leader. Therefore there cannot be independent > > >> lifecycles > > >>> for some nodes and other nodes. > > >>> B.T.W the restart mentioned below meant - 'transparent restart > > >>> of OpenSAF' without affecting applications and for this the > > >> application's > > >>> CLC-CLI scripts need to handle. > > >>> > > >>> While we could be inclined to support (unique!) use-case such > as > > >> this, > > >>> it is the solution(headless fork) that would be under the lens! > > ;-) > > >> > > >>> Let's discuss this! > > >>> > > >>> Cheers, > > >>> Mathi. > > >>> > > >>> > > >>> ----- [email protected] wrote: > > >>> > > >>>> Yes, this is yet another approach. But it is also another > > use-case > > >> for > > >>>> > > >>>> the headless feature. When we have moved the system > controllers > > out > > >> of > > >>>> > > >>>> the cluster (into the cloud infrastructure), I would expect > > >>>> controllers > > >>>> and payloads to have independent life cycles. You have servers > > >> (i.e. > > >>>> system controllers), and clients (payloads). They can be > > installed > > >> and > > >>>> > > >>>> upgraded separately from each other, and I wouldn't expect a > > >> restart > > >>>> of > > >>>> the servers to cause all the clients to restart as well, in > the > > >> same > > >>>> way > > >>>> as I don't expect my web browser to restart just because > because > > >> the > > >>>> web > > >>>> server has crashed. > > >>>> > > >>>> / Anders Widell > > >>>> > > >>>> On 10/13/2015 03:54 PM, Mathivanan Naickan Palanivelu wrote: > > >>>>> I don't think this is a case of cattles! Even in those > scenario > > >>>>> the cloud management stacks, the "**controller" software > > >> themselves > > >>>> are 'placed' on physical nodes > > >>>>> in appropriate redundancy models and not inside those cattle > > VMs! > > >>>>> > > >>>>> I think the case here is about avoid rebooting of the node! > > >>>>> This can be achieved by setting the NOACTIVE timer to a > longer > > >> value > > >>>> till OpenSAF on the controller comes back up. > > >>>>> Upon detecting that the controllers are up, some entity on > the > > >> local > > >>>> node restart OpenSAF (/etc/init.d/opensafd restart) > > >>>>> And ensure the CLC-CLI scripts of the applications > > differentiate > > >>>> usual restart versus this spoof-restart! > > >>>>> > > >>>>> Mathi. > > >>>>> > > >>>>>> -----Original Message----- > > >>>>>> From: Anders Widell [mailto:[email protected]] > > >>>>>> Sent: Tuesday, October 13, 2015 5:36 PM > > >>>>>> To: Anders Björnerstedt; Tony Hart > > >>>>>> Cc: [email protected] > > >>>>>> Subject: Re: [users] Avoid rebooting payload modules after > > >> losing > > >>>> system > > >>>>>> controller > > >>>>>> > > >>>>>> Yes, I agree that the best fit for this feature is an > > >> application > > >>>> using either the > > >>>>>> NWay-Active or the No-Redundancy models, and where you view > > the > > >>>>>> system more as a collection of nodes rather than as a > cluster. > > >> This > > >>>> kind of > > >>>>>> architecture is quite common when you write applications for > > >> cloud. > > >>>> The > > >>>>>> redundancy models are suitable for scaling, and the > > architecture > > >>>> fits into the > > >>>>>> "cattle" philosophy which is common in cloud. > > >>>>>> Such an application can tolerate any number of node > failures, > > >> and > > >>>> the > > >>>>>> remaining nodes would still be able to continue functioning > > and > > >>>> provide their > > >>>>>> service. However, if we put the OpenSAF middleware on the > > nodes > > >> it > > >>>>>> becomes the weakest link, since OpenSAF will reboot all the > > >> nodes > > >>>> just > > >>>>>> because the two controller nodes fail. What a pity on a > system > > >> with > > >>>> one > > >>>>>> hundred nodes! > > >>>>>> > > >>>>>> / Anders Widell > > >>>>>> > > >>>>>> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote: > > >>>>>>> > > >>>>>>> On 10/13/2015 12:27 PM, Anders Widell wrote: > > >>>>>>>> The possibility to have more than two system controllers > > (one > > >>>> active > > >>>>>>>> + several standby and/or spare controller nodes) is also > > >>>> something > > >>>>>>>> that has been investigated. For scalability reasons, we > > >> probably > > >>>>>>>> can't turn all nodes into standby controllers in a large > > >> cluster > > >>>> - > > >>>>>>>> but it may be feasible to have a system with one or > several > > >>>> standby > > >>>>>>>> controllers and the rest of the nodes are spares that are > > >> ready > > >>>> to > > >>>>>>>> take an active or standby assignment when needed. > > >>>>>>>> > > >>>>>>>> However, the "headless" feature will still be needed in > some > > >>>> systems > > >>>>>>>> where you need dedicated controller node(s). > > >>>>>>> That sounds as if some deployments have a special > requirement > > >> that > > >>>> can > > >>>>>>> only be supported by the headless feature. > > >>>>>>> But you also have to say that the headless feature places > > >>>>>>> anti-requirements on the deployments/applications that are > to > > >> use > > >>>> it. > > >>>>>>> > > >>>>>>> For example not needing cluster coherence among the > payloads. > > >>>>>>> > > >>>>>>> If the payloads only run independent application instances > > >> where > > >>>> each > > >>>>>>> instance is implemented at one processor or at least does > not > > >>>>>>> communicate in any state-sensitive way with peer processes > at > > >>>> other > > >>>>>>> payloads; and no such instance is unique or if it is unique > > it > > >> is > > >>>>>>> still expendable (non critical to the service), then it > could > > >>>> work. > > >>>>>>> > > >>>>>>> It is important the the deployments that end up thinking > they > > >> need > > >>>> the > > >>>>>>> headless feature also understand what they loose with the > > >>>> headless > > >>>>>>> feature and that this loss is acceptable for that > deployment. > > >>>>>>> > > >>>>>>> So headless is not a fancy feature needed by some exclusive > > and > > >>>> picky > > >>>>>>> subset of applications. > > >>>>>>> It is a relaxation that drops all requirements on > distributed > > >>>>>>> consistency and may be acceptable to some applications with > > >>>> weaker > > >>>>>>> demands so they can accept the anti requirements. > > >>>>>>> > > >>>>>>> Besides requiring "dedicated" controller nodes, the > > deployment > > >>>> must of > > >>>>>>> course NOT require any *availability* of those dedicated > > >>>> controller > > >>>>>>> nodes, i.e. not have any requirements on service > availability > > >> in > > >>>>>>> general. > > >>>>>>> > > >>>>>>> It may works for some "dumb" applications that are > stateless, > > >> or > > >>>> state > > >>>>>>> stable (frozen in state), or have no requirements on > > >> availability > > >>>> of > > >>>>>>> state. In other words some applicaitons that really dont > need > > >>>> SAF. > > >>>>>>> > > >>>>>>> They may still want to use SAF as a way of managing and > > >> monitoring > > >>>> the > > >>>>>>> system when it happens to be healthy, but can live with > long > > >>>> periods > > >>>>>>> of not being able to manage or monitor that system, which > can > > >> then > > >>>> be > > >>>>>>> degrading in any way that is possible. > > >>>>>>> > > >>>>>>> > > >>>>>>> /AndersBJ > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>>> / Anders Widell > > >>>>>>>> > > >>>>>>>> On 10/13/2015 12:07 PM, Tony Hart wrote: > > >>>>>>>>> Understood. The assumption is that this is temporary but > > we > > >>>> allow > > >>>>>>>>> the payloads to continue to run (with reduced osaf > > >>>> functionality) > > >>>>>>>>> until a replacement controller is found. At that point > > they > > >>>> can > > >>>>>>>>> reboot to get the system back into sync. > > >>>>>>>>> > > >>>>>>>>> Or allow more than 2 controllers in the system so we can > > have > > >>>> one or > > >>>>>>>>> more usually-payload cards be controllers to reduce the > > >>>> probability > > >>>>>>>>> of no-controllers to an acceptable level. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt > > >>>>>>>>>> <[email protected]> wrote: > > >>>>>>>>>> > > >>>>>>>>>> The headless state is also vulnerable to split-brain > > >>>> scenarios. > > >>>>>>>>>> That is network partitions and joins can occur and will > > not > > >> be > > >>>>>>>>>> detected as such and thus not handled properly > (isolated) > > >> when > > >>>> they > > >>>>>>>>>> occur. > > >>>>>>>>>> Basically you can not be sure you have a continuously > > >>>> coherent > > >>>>>>>>>> cluster while in the headless state. > > >>>>>>>>>> > > >>>>>>>>>> On paper you may get a very resilient system in the > sense > > >> that > > >>>> It > > >>>>>>>>>> "stays up" and replies on ping etc. > > >>>>>>>>>> But typically a customer wants not just availability but > > >>>> reliable > > >>>>>>>>>> behavior also. > > >>>>>>>>>> > > >>>>>>>>>> /AndersBj > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> -----Original Message----- > > >>>>>>>>>> From: Anders Björnerstedt > > >>>>>> [mailto:[email protected]] > > >>>>>>>>>> Sent: den 12 oktober 2015 16:42 > > >>>>>>>>>> To: Anders Widell; Tony Hart; > > >>>> [email protected] > > >>>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules > after > > >>>> losing > > >>>>>>>>>> system controller > > >>>>>>>>>> > > >>>>>>>>>> Note that this headless variant is a very questionable > > >>>> feature. > > >>>>>>>>>> This for the reasons explained earlier, i.e. you *will* > > get > > >> a > > >>>>>>>>>> reduction in service availability. > > >>>>>>>>>> It was never accepted into OpenSAF for that reason. > > >>>>>>>>>> > > >>>>>>>>>> On top of that the unreliability will typically not he > > >>>>>>>>>> explicit/handled. That is the operator will probably not > > >> even > > >>>> know > > >>>>>>>>>> what is working and what is not during the SC absence > > since > > >>>> the > > >>>>>>>>>> alarm/notification function is gone. No OpenSAF > director > > >>>> services > > >>>>>>>>>> are executing. > > >>>>>>>>>> > > >>>>>>>>>> It is truly a headless system, i.e. a zombie system and > > thus > > >>>> not > > >>>>>>>>>> working at full monitoring and availability > functionality. > > >>>>>>>>>> It begs the question of what OpenSAF and SAF is there > for > > in > > >>>> the > > >>>>>>>>>> first place. > > >>>>>>>>>> > > >>>>>>>>>> The SCs don’t have to run any special software and don’t > > >> have > > >>>> to > > >>>>>>>>>> have any special hardware. > > >>>>>>>>>> They do need file system access, at least for a cluster > > >>>> restart, > > >>>>>>>>>> but not necessarily to handle single SC failure. > > >>>>>>>>>> The headless variant when headless is also in that > > >>>>>>>>>> not-able-to-cluster-restart also, but with even less > > >>>> functionality. > > >>>>>>>>>> > > >>>>>>>>>> An SC can of course run other (non OpenSAF specific) > > >> software. > > >>>> And > > >>>>>>>>>> the two SCs don’t necessarily have to be symmetric in > > terms > > >> of > > >>>>>>>>>> software. > > >>>>>>>>>> > > >>>>>>>>>> Providing file system access via NFS is typically a non > > >> issue. > > >>>> They > > >>>>>>>>>> have three nodes. Ergo they should be able to assign > two > > of > > >>>> them > > >>>>>>>>>> the role of SC in the OpensAF domain. > > >>>>>>>>>> > > >>>>>>>>>> /AndersBj > > >>>>>>>>>> > > >>>>>>>>>> -----Original Message----- > > >>>>>>>>>> From: Anders Widell [mailto:[email protected]] > > >>>>>>>>>> Sent: den 12 oktober 2015 16:08 > > >>>>>>>>>> To: Tony Hart; [email protected] > > >>>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules > after > > >>>> losing > > >>>>>>>>>> system controller > > >>>>>>>>>> > > >>>>>>>>>> We have actually implemented something very similar to > > what > > >> you > > >>>> are > > >>>>>>>>>> talking about. With this feature, the payloads can > survive > > >>>> without > > >>>>>>>>>> a cluster restart even if both system controllers > restart > > >> (or > > >>>> the > > >>>>>>>>>> single system controller, in your case). If you want to > > try > > >> it > > >>>> out, > > >>>>>>>>>> you can clone this Mercurial repository: > > >>>>>>>>>> > > >>>>>>>>>> https://sourceforge.net/u/anders-w/opensaf-headless/ > > >>>>>>>>>> > > >>>>>>>>>> To enable the feature, set the variable > > >>>>>> IMMSV_SC_ABSENCE_ALLOWED in > > >>>>>>>>>> immd.conf to the amount of seconds you wish the payloads > > to > > >>>> wait > > >>>>>>>>>> for the system controllers to come back. Note: we have > > only > > >>>>>>>>>> implemented this feature for the "core" OpenSAF services > > >> (plus > > >>>>>>>>>> CKPT), so you need to disable the optional serivces. > > >>>>>>>>>> > > >>>>>>>>>> / Anders Widell > > >>>>>>>>>> > > >>>>>>>>>> On 10/11/2015 02:30 PM, Tony Hart wrote: > > >>>>>>>>>>> We have been using opensaf in our product for a couple > of > > >>>> years > > >>>>>>>>>>> now. One of the issues we have is the fact that > payload > > >>>> cards > > >>>>>>>>>>> reboot when the system controllers are lost. Although > > our > > >>>> payload > > >>>>>>>>>>> card hardware will continue to perform its functions > > whilst > > >>>> the > > >>>>>>>>>>> software is down (which is desirable) the functions > that > > >> the > > >>>>>>>>>>> software performs are obviously not performed (which is > > not > > >>>>>>>>>>> desirable). > > >>>>>>>>>>> > > >>>>>>>>>>> Why would we loose both controllers, surely that is a > > rare > > >>>>>>>>>>> circumstance? Not if you only have one controller to > > begin > > >>>> with. > > >>>>>>>>>>> Removing the second controller is a significant cost > > saving > > >>>> for us > > >>>>>>>>>>> so we want to support a product that only has one > > >> controller. > > >>>> The > > >>>>>>>>>>> most significant impediment to that is the loss of > > payload > > >>>>>>>>>>> software functions when the system controller fails. > > >>>>>>>>>>> > > >>>>>>>>>>> I’m looking for suggestions from this email list as to > > what > > >>>> could > > >>>>>>>>>>> be done for this issue. > > >>>>>>>>>>> > > >>>>>>>>>>> One suggestion, that would work for us, is if we could > > >>>> convince > > >>>>>>>>>>> the payload card to only reboot when the controller > > >> reappears > > >>>>>>>>>>> after a loss rather than when the loss initially occurs. > > > >> Is > > >>>> that > > >>>>>>>>>>> possible? > > >>>>>>>>>>> > > >>>>>>>>>>> Another possibility is if we could support more than 2 > > >>>>>>>>>>> controllers, for example if we could support 4 (one > > active > > >> and > > >>>> 3 > > >>>>>>>>>>> standbys) that would also provide a solution for us > (our > > >>>> current > > >>>>>>>>>>> payloads would instead become controllers). I know that > > >> this > > >>>> is > > >>>>>>>>>>> not currently possible with opensaf. > > >>>>>>>>>>> > > >>>>>>>>>>> thanks for any suggestions, > > >>>>>>>>>>> — > > >>>>>>>>>>> tony > > >>>>>>>>>>> > > >>>> > > ------------------------------------------------------------------ > > >>>>>>>>>>> ---- > > >>>>>>>>>>> > > >>>>>>>>>>> -------- > _______________________________________________ > > >>>>>>>>>>> Opensaf-users mailing list > > >>>>>>>>>>> [email protected] > > >>>>>>>>>>> > > https://lists.sourceforge.net/lists/listinfo/opensaf-users > > >>>>>>>>>> > > >>>>>>>>>> > > >>>> > > >> > > ------------------------------------------------------------------- > > >>>>>>>>>> ----------- > > >>>>>>>>>> > > >>>>>>>>>> _______________________________________________ > > >>>>>>>>>> Opensaf-users mailing list > > >>>>>>>>>> [email protected] > > >>>>>>>>>> > https://lists.sourceforge.net/lists/listinfo/opensaf-users > > >>>>>>>>>> > > >>>> > > >> > > ------------------------------------------------------------------- > > >>>>>>>>>> ----------- > > >>>>>>>>>> > > >>>>>>>>>> _______________________________________________ > > >>>>>>>>>> Opensaf-users mailing list > > >>>>>>>>>> [email protected] > > >>>>>>>>>> > https://lists.sourceforge.net/lists/listinfo/opensaf-users > > >>>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>> > > >> > > > ------------------------------------------------------------------------------ > > >>>>>> _______________________________________________ > > >>>>>> Opensaf-users mailing list > > >>>>>> [email protected] > > >>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
