Re: [users] Avoid rebooting payload modules after losing system controller

Mathivanan Naickan Palanivelu Tue, 22 Dec 2015 03:10:55 -0800

please see a correction inline:

----- [email protected] wrote:


> Hi Tony,
> 
> Yes, it has been agreed to support more than two controllers in a
> phased manner.
> 
> The ticket http://sourceforge.net/p/opensaf/tickets/79/ will be
> available
> as a part of the next major release 5.0 in April 2016.
> This ticket #79 will provide support for configuring spare system
> controllers.
> i.e. Everytime a controller dies, a spare will be promoted as an
> ACTIVE.
A spare node will be promoted as an SC.

Mathi.

> In your case, one of the line cards can be configured as a spare!
> Please see the attachment in the ticket for a overview of the
> feature.
> 
> 
> At the same time, it has been agreed to include the solution(in 5.0) 
> also in
> https://sourceforge.net/u/anders-w/opensaf-headless/ (currently 
> maintained as a fork) into the main branch. This feature
> will be configurable to be turned on/off. 
> 
> 
> Subsequently, 5.1 will have support for a mix of multiplestandbys +
> spares. 
> 
> The details if this has been updated in the following two tickets:
> http://sourceforge.net/p/opensaf/tickets/439/
> and
> http://sourceforge.net/p/opensaf/tickets/1170/
> 
> 
> Cheers,
> Mathi.
> 
> 
> 
> 
> 
> 
> ----- [email protected] wrote:
> 
> > Hi Mathi,
> > Any updates on this topic?
> > thanks
> > —
> > tony
> > 
> > 
> > > On Oct 14, 2015, at 7:50 AM, Mathivanan Naickan Palanivelu
> > <[email protected]> wrote:
> > > 
> > > Tony,
> > > 
> > > We(in the TLC) have discussed this before. At this point of time
> > > you indeed have the solution that AndersW has pointed to.
> > > 
> > > There are also other discussions going on w.r.t multiple-standbys
> > etc
> > > for the next major release. Will tag you in the appropriate
> ticket
> > > (one such ticket is
> http://sourceforge.net/p/opensaf/tickets/1170/)
> > > once those discussions conclude.
> > > 
> > > Cheers,
> > > Mathi.
> > > 
> > > ----- [email protected] wrote:
> > > 
> > >> Thanks all, this is a good discussion.
> > >> 
> > >> Let me again state my use case, which as I said I don’t think is
> > >> unique, because selfishly I want to make sure that this is
> covered
> > >> :-)
> > >> 
> > >> I want to keep the system running in the case where both
> > controller
> > >> fail, or more specifically in my case, when the only server in
> the
> > >> system fails (cost reduced system).
> > >> 
> > >> The "don’t reboot until a server re-appears" would work for me. 
> I
> > >> think its a poorer solution but it would be workable (Mathi’s
> > >> suggestion).  Especially if its available sooner.
> > >> 
> > >> A better option is to allow one of the linecards to take over
> the
> > >> controller role when the preferred one dies.  Since linecards
> are
> > >> assumed to be transitory (they may or may not be there or could
> be
> > >> removed) so I don’t want to have to pick a linecard for this
> role
> > at
> > >> boot time.   Much better would be to allow some number (more
> than
> > 2)
> > >> nodes to be controllers (e.g. Active, Standby, Spare, Spare …). 
> > Then
> > >> OSAF takes responsibility for electing the next Active/Standby
> > when
> > >> necessary.  There would have to be a concept of preference such
> > that
> > >> the Server node is chosen given a choice.
> > >> 
> > >> For the solutions discussed so far…
> > >> 
> > >> Making the OSAF processes restartable and be active on either
> > >> controller, is good and will increase the MTBF of an OSAF system.
> 
> > >> However it won’t fix my problem if there are still only two
> > >> controllers allowed.
> > >> 
> > >> I’m not familiar with the “cattle” concept.
> > >> 
> > >> I wasn’t advocating the “headless” solution only to the extent
> that
> > it
> > >> solved my immediate problem.  Even if we did have a “headless”
> mode
> > it
> > >> would be a temporary situation until the failed controller could
> > be
> > >> replaced.
> > >> 
> > >> thanks
> > >> —
> > >> tony
> > >> 
> > >> 
> > >>> On Oct 14, 2015, at 6:44 AM, Mathivanan Naickan Palanivelu
> > >> <[email protected]> wrote:
> > >>> 
> > >>> In a clustered environment, all nodes need to be in the same
> > >> consistent view
> > >>> from a membership perspective. 
> > >>> So, loss of a leader will indeed result in state changes to the
> > >> nodes
> > >>> following the leader. Therefore there cannot be independent
> > >> lifecycles
> > >>> for some nodes and other nodes. 
> > >>> B.T.W the restart mentioned below meant - 'transparent restart 
> > >>> of OpenSAF' without affecting applications and for this the
> > >> application's
> > >>> CLC-CLI scripts need to handle.
> > >>> 
> > >>> While we could be inclined to support (unique!) use-case such
> as
> > >> this, 
> > >>> it is the solution(headless fork) that would be under the lens!
> > ;-)
> > >> 
> > >>> Let's discuss this!
> > >>> 
> > >>> Cheers,
> > >>> Mathi.
> > >>> 
> > >>> 
> > >>> ----- [email protected] wrote:
> > >>> 
> > >>>> Yes, this is yet another approach. But it is also another
> > use-case
> > >> for
> > >>>> 
> > >>>> the headless feature. When we have moved the system
> controllers
> > out
> > >> of
> > >>>> 
> > >>>> the cluster (into the cloud infrastructure), I would expect
> > >>>> controllers 
> > >>>> and payloads to have independent life cycles. You have servers
> > >> (i.e. 
> > >>>> system controllers), and clients (payloads). They can be
> > installed
> > >> and
> > >>>> 
> > >>>> upgraded separately from each other, and I wouldn't expect a
> > >> restart
> > >>>> of 
> > >>>> the servers to cause all the clients to restart as well, in
> the
> > >> same
> > >>>> way 
> > >>>> as I don't expect my web browser to restart just because
> because
> > >> the
> > >>>> web 
> > >>>> server has crashed.
> > >>>> 
> > >>>> / Anders Widell
> > >>>> 
> > >>>> On 10/13/2015 03:54 PM, Mathivanan Naickan Palanivelu wrote:
> > >>>>> I don't think this is a case of cattles! Even in those
> scenario
> > >>>>> the cloud management stacks, the  "**controller" software
> > >> themselves
> > >>>> are 'placed' on physical nodes
> > >>>>> in appropriate redundancy models and not inside those cattle
> > VMs!
> > >>>>> 
> > >>>>> I think the case here is about avoid rebooting of the node!
> > >>>>> This can be achieved by setting the NOACTIVE timer to a
> longer
> > >> value
> > >>>> till OpenSAF on the controller comes back up.
> > >>>>> Upon detecting that the controllers are up, some entity on
> the
> > >> local
> > >>>> node restart OpenSAF (/etc/init.d/opensafd restart)
> > >>>>> And ensure the CLC-CLI scripts of the applications
> > differentiate
> > >>>> usual restart versus this spoof-restart!
> > >>>>> 
> > >>>>> Mathi.
> > >>>>> 
> > >>>>>> -----Original Message-----
> > >>>>>> From: Anders Widell [mailto:[email protected]]
> > >>>>>> Sent: Tuesday, October 13, 2015 5:36 PM
> > >>>>>> To: Anders Björnerstedt; Tony Hart
> > >>>>>> Cc: [email protected]
> > >>>>>> Subject: Re: [users] Avoid rebooting payload modules after
> > >> losing
> > >>>> system
> > >>>>>> controller
> > >>>>>> 
> > >>>>>> Yes, I agree that the best fit for this feature is an
> > >> application
> > >>>> using either the
> > >>>>>> NWay-Active or the No-Redundancy models, and where you view
> > the
> > >>>>>> system more as a collection of nodes rather than as a
> cluster.
> > >> This
> > >>>> kind of
> > >>>>>> architecture is quite common when you write applications for
> > >> cloud.
> > >>>> The
> > >>>>>> redundancy models are suitable for scaling, and the
> > architecture
> > >>>> fits into the
> > >>>>>> "cattle" philosophy which is common in cloud.
> > >>>>>> Such an application can tolerate any number of node
> failures,
> > >> and
> > >>>> the
> > >>>>>> remaining nodes would still be able to continue functioning
> > and
> > >>>> provide their
> > >>>>>> service. However, if we put the OpenSAF middleware on the
> > nodes
> > >> it
> > >>>>>> becomes the weakest link, since OpenSAF will reboot all the
> > >> nodes
> > >>>> just
> > >>>>>> because the two controller nodes fail. What a pity on a
> system
> > >> with
> > >>>> one
> > >>>>>> hundred nodes!
> > >>>>>> 
> > >>>>>> / Anders Widell
> > >>>>>> 
> > >>>>>> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote:
> > >>>>>>> 
> > >>>>>>> On 10/13/2015 12:27 PM, Anders Widell wrote:
> > >>>>>>>> The possibility to have more than two system controllers
> > (one
> > >>>> active
> > >>>>>>>> + several standby and/or spare controller nodes) is also
> > >>>> something
> > >>>>>>>> that has been investigated. For scalability reasons, we
> > >> probably
> > >>>>>>>> can't turn all nodes into standby controllers in a large
> > >> cluster
> > >>>> -
> > >>>>>>>> but it may be feasible to have a system with one or
> several
> > >>>> standby
> > >>>>>>>> controllers and the rest of the nodes are spares that are
> > >> ready
> > >>>> to
> > >>>>>>>> take an active or standby assignment when needed.
> > >>>>>>>> 
> > >>>>>>>> However, the "headless" feature will still be needed in
> some
> > >>>> systems
> > >>>>>>>> where you need dedicated controller node(s).
> > >>>>>>> That sounds as if some deployments have a special
> requirement
> > >> that
> > >>>> can
> > >>>>>>> only be supported by the headless feature.
> > >>>>>>> But you also have to say that the headless feature places
> > >>>>>>> anti-requirements on the deployments/applications that are
> to
> > >> use
> > >>>> it.
> > >>>>>>> 
> > >>>>>>> For example not needing cluster coherence among the
> payloads.
> > >>>>>>> 
> > >>>>>>> If the payloads only run independent application instances
> > >> where
> > >>>> each
> > >>>>>>> instance is implemented at one processor or at least does
> not
> > >>>>>>> communicate in any state-sensitive way with peer processes
> at
> > >>>> other
> > >>>>>>> payloads; and no such instance is unique or if it is unique
> > it
> > >> is
> > >>>>>>> still expendable (non critical to the service), then it
> could
> > >>>> work.
> > >>>>>>> 
> > >>>>>>> It is important the the deployments that end up thinking
> they
> > >> need
> > >>>> the
> > >>>>>>> headless feature also understand what they loose with the
> > >>>> headless
> > >>>>>>> feature and that this loss is acceptable for that
> deployment.
> > >>>>>>> 
> > >>>>>>> So headless is not a fancy feature needed by some exclusive
> > and
> > >>>> picky
> > >>>>>>> subset of applications.
> > >>>>>>> It is a relaxation that drops all requirements on
> distributed
> > >>>>>>> consistency and may be acceptable to some applications with
> > >>>> weaker
> > >>>>>>> demands so they can accept the anti requirements.
> > >>>>>>> 
> > >>>>>>> Besides requiring "dedicated" controller nodes, the
> > deployment
> > >>>> must of
> > >>>>>>> course NOT require any *availability* of those dedicated
> > >>>> controller
> > >>>>>>> nodes, i.e. not have any requirements on service
> availability
> > >> in
> > >>>>>>> general.
> > >>>>>>> 
> > >>>>>>> It may works for some "dumb" applications that are
> stateless,
> > >> or
> > >>>> state
> > >>>>>>> stable (frozen in state), or have no requirements on
> > >> availability
> > >>>> of
> > >>>>>>> state. In other words some applicaitons that really dont
> need
> > >>>> SAF.
> > >>>>>>> 
> > >>>>>>> They may still want to use SAF as a way of managing and
> > >> monitoring
> > >>>> the
> > >>>>>>> system when it happens to be healthy, but can live with 
> long
> > >>>> periods
> > >>>>>>> of not being able to manage or monitor that system, which
> can
> > >> then
> > >>>> be
> > >>>>>>> degrading in any way that is possible.
> > >>>>>>> 
> > >>>>>>> 
> > >>>>>>> /AndersBJ
> > >>>>>>> 
> > >>>>>>> 
> > >>>>>>> 
> > >>>>>>>> / Anders Widell
> > >>>>>>>> 
> > >>>>>>>> On 10/13/2015 12:07 PM, Tony Hart wrote:
> > >>>>>>>>> Understood.  The assumption is that this is temporary but
> > we
> > >>>> allow
> > >>>>>>>>> the payloads to continue to run (with reduced osaf
> > >>>> functionality)
> > >>>>>>>>> until a replacement controller is found.  At that point
> > they
> > >>>> can
> > >>>>>>>>> reboot to get the system back into sync.
> > >>>>>>>>> 
> > >>>>>>>>> Or allow more than 2 controllers in the system so we can
> > have
> > >>>> one or
> > >>>>>>>>> more usually-payload cards be controllers to reduce the
> > >>>> probability
> > >>>>>>>>> of no-controllers to an acceptable level.
> > >>>>>>>>> 
> > >>>>>>>>> 
> > >>>>>>>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt
> > >>>>>>>>>> <[email protected]> wrote:
> > >>>>>>>>>> 
> > >>>>>>>>>> The headless state is also vulnerable to split-brain
> > >>>> scenarios.
> > >>>>>>>>>> That is network partitions and joins can occur and will
> > not
> > >> be
> > >>>>>>>>>> detected as such and thus not handled properly
> (isolated)
> > >> when
> > >>>> they
> > >>>>>>>>>> occur.
> > >>>>>>>>>> Basically you can  not be sure you have a continuously
> > >>>> coherent
> > >>>>>>>>>> cluster while in the headless state.
> > >>>>>>>>>> 
> > >>>>>>>>>> On paper you may get a very resilient system in the
> sense
> > >> that
> > >>>> It
> > >>>>>>>>>> "stays up"  and replies on ping etc.
> > >>>>>>>>>> But typically a customer wants not just availability but
> > >>>> reliable
> > >>>>>>>>>> behavior also.
> > >>>>>>>>>> 
> > >>>>>>>>>> /AndersBj
> > >>>>>>>>>> 
> > >>>>>>>>>> 
> > >>>>>>>>>> -----Original Message-----
> > >>>>>>>>>> From: Anders Björnerstedt
> > >>>>>> [mailto:[email protected]]
> > >>>>>>>>>> Sent: den 12 oktober 2015 16:42
> > >>>>>>>>>> To: Anders Widell; Tony Hart;
> > >>>> [email protected]
> > >>>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules
> after
> > >>>> losing
> > >>>>>>>>>> system controller
> > >>>>>>>>>> 
> > >>>>>>>>>> Note that this headless variant  is a very questionable
> > >>>> feature.
> > >>>>>>>>>> This for the reasons explained earlier, i.e. you *will* 
> > get
> > >> a
> > >>>>>>>>>> reduction in service availability.
> > >>>>>>>>>> It was never accepted into OpenSAF for that reason.
> > >>>>>>>>>> 
> > >>>>>>>>>> On top of that the unreliability will typically not he
> > >>>>>>>>>> explicit/handled. That is the operator will probably not
> > >> even
> > >>>> know
> > >>>>>>>>>> what is working and what is not during the SC absence
> > since
> > >>>> the
> > >>>>>>>>>> alarm/notification  function is gone. No OpenSAF
> director
> > >>>> services
> > >>>>>>>>>> are executing.
> > >>>>>>>>>> 
> > >>>>>>>>>> It is truly a headless system, i.e. a zombie system and
> > thus
> > >>>> not
> > >>>>>>>>>> working at full monitoring and availability
> functionality.
> > >>>>>>>>>> It begs the question of what OpenSAF and SAF is there
> for
> > in
> > >>>> the
> > >>>>>>>>>> first place.
> > >>>>>>>>>> 
> > >>>>>>>>>> The SCs don’t have to run any special software and don’t
> > >> have
> > >>>> to
> > >>>>>>>>>> have any special hardware.
> > >>>>>>>>>> They do need file system access, at least for a cluster
> > >>>> restart,
> > >>>>>>>>>> but not necessarily to handle single SC failure.
> > >>>>>>>>>> The headless variant when headless is also in that
> > >>>>>>>>>> not-able-to-cluster-restart also, but with even less
> > >>>> functionality.
> > >>>>>>>>>> 
> > >>>>>>>>>> An SC can of course run other (non OpenSAF specific)
> > >> software. 
> > >>>> And
> > >>>>>>>>>> the two SCs don’t necessarily have to be symmetric in
> > terms
> > >> of
> > >>>>>>>>>> software.
> > >>>>>>>>>> 
> > >>>>>>>>>> Providing file system access via NFS is typically a non
> > >> issue.
> > >>>> They
> > >>>>>>>>>> have three nodes. Ergo  they should be able to assign
> two
> > of
> > >>>> them
> > >>>>>>>>>> the role of SC in the OpensAF domain.
> > >>>>>>>>>> 
> > >>>>>>>>>> /AndersBj
> > >>>>>>>>>> 
> > >>>>>>>>>> -----Original Message-----
> > >>>>>>>>>> From: Anders Widell [mailto:[email protected]]
> > >>>>>>>>>> Sent: den 12 oktober 2015 16:08
> > >>>>>>>>>> To: Tony Hart; [email protected]
> > >>>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules
> after
> > >>>> losing
> > >>>>>>>>>> system controller
> > >>>>>>>>>> 
> > >>>>>>>>>> We have actually implemented something very similar to
> > what
> > >> you
> > >>>> are
> > >>>>>>>>>> talking about. With this feature, the payloads can
> survive
> > >>>> without
> > >>>>>>>>>> a cluster restart even if both system controllers
> restart
> > >> (or
> > >>>> the
> > >>>>>>>>>> single system controller, in your case). If you want to
> > try
> > >> it
> > >>>> out,
> > >>>>>>>>>> you can clone this Mercurial repository:
> > >>>>>>>>>> 
> > >>>>>>>>>> https://sourceforge.net/u/anders-w/opensaf-headless/
> > >>>>>>>>>> 
> > >>>>>>>>>> To enable the feature, set the variable
> > >>>>>> IMMSV_SC_ABSENCE_ALLOWED in
> > >>>>>>>>>> immd.conf to the amount of seconds you wish the payloads
> > to
> > >>>> wait
> > >>>>>>>>>> for the system controllers to come back. Note: we have
> > only
> > >>>>>>>>>> implemented this feature for the "core" OpenSAF services
> > >> (plus
> > >>>>>>>>>> CKPT), so you need to disable the optional serivces.
> > >>>>>>>>>> 
> > >>>>>>>>>> / Anders Widell
> > >>>>>>>>>> 
> > >>>>>>>>>> On 10/11/2015 02:30 PM, Tony Hart wrote:
> > >>>>>>>>>>> We have been using opensaf in our product for a couple
> of
> > >>>> years
> > >>>>>>>>>>> now.  One of the issues we have is the fact that
> payload
> > >>>> cards
> > >>>>>>>>>>> reboot when the system controllers are lost.  Although
> > our
> > >>>> payload
> > >>>>>>>>>>> card hardware will continue to perform its functions
> > whilst
> > >>>> the
> > >>>>>>>>>>> software is down (which is desirable) the functions
> that
> > >> the
> > >>>>>>>>>>> software performs are obviously not performed (which is
> > not
> > >>>>>>>>>>> desirable).
> > >>>>>>>>>>> 
> > >>>>>>>>>>> Why would we loose both controllers, surely that is a
> > rare
> > >>>>>>>>>>> circumstance?  Not if you only have one controller to
> > begin
> > >>>> with.
> > >>>>>>>>>>> Removing the second controller is a significant cost
> > saving
> > >>>> for us
> > >>>>>>>>>>> so we want to support a product that only has one
> > >> controller. 
> > >>>> The
> > >>>>>>>>>>> most significant impediment to that is the loss of
> > payload
> > >>>>>>>>>>> software functions when the system controller fails.
> > >>>>>>>>>>> 
> > >>>>>>>>>>> I’m looking for suggestions from this email list as to
> > what
> > >>>> could
> > >>>>>>>>>>> be done for this issue.
> > >>>>>>>>>>> 
> > >>>>>>>>>>> One suggestion, that would work for us, is if we could
> > >>>> convince
> > >>>>>>>>>>> the payload card to only reboot when the controller
> > >> reappears
> > >>>>>>>>>>> after a loss rather than when the loss initially occurs.
> 
> > >> Is
> > >>>> that
> > >>>>>>>>>>> possible?
> > >>>>>>>>>>> 
> > >>>>>>>>>>> Another possibility is if we could support more than 2
> > >>>>>>>>>>> controllers, for example if we could support 4 (one
> > active
> > >> and
> > >>>> 3
> > >>>>>>>>>>> standbys) that would also provide a solution for us
> (our
> > >>>> current
> > >>>>>>>>>>> payloads would instead become controllers). I know that
> > >> this
> > >>>> is
> > >>>>>>>>>>> not currently possible with opensaf.
> > >>>>>>>>>>> 
> > >>>>>>>>>>> thanks for any suggestions,
> > >>>>>>>>>>> —
> > >>>>>>>>>>> tony
> > >>>>>>>>>>> 
> > >>>>
> > ------------------------------------------------------------------
> > >>>>>>>>>>> ----
> > >>>>>>>>>>> 
> > >>>>>>>>>>> --------
> _______________________________________________
> > >>>>>>>>>>> Opensaf-users mailing list
> > >>>>>>>>>>> [email protected]
> > >>>>>>>>>>>
> > https://lists.sourceforge.net/lists/listinfo/opensaf-users
> > >>>>>>>>>> 
> > >>>>>>>>>> 
> > >>>> 
> > >>
> > -------------------------------------------------------------------
> > >>>>>>>>>> -----------
> > >>>>>>>>>> 
> > >>>>>>>>>> _______________________________________________
> > >>>>>>>>>> Opensaf-users mailing list
> > >>>>>>>>>> [email protected]
> > >>>>>>>>>>
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> > >>>>>>>>>> 
> > >>>> 
> > >>
> > -------------------------------------------------------------------
> > >>>>>>>>>> -----------
> > >>>>>>>>>> 
> > >>>>>>>>>> _______________________________________________
> > >>>>>>>>>> Opensaf-users mailing list
> > >>>>>>>>>> [email protected]
> > >>>>>>>>>>
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> > >>>>>>>> 
> > >>>>>>> 
> > >>>>>>> 
> > >>>>>> 
> > >>>>>> 
> > >>>>>> 
> > >>>> 
> > >>
> >
> ------------------------------------------------------------------------------
> > >>>>>> _______________________________________________
> > >>>>>> Opensaf-users mailing list
> > >>>>>> [email protected]
> > >>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users

------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Avoid rebooting payload modules after losing system controller

Reply via email to