Re: [users] Avoid rebooting payload modules after losing system controller

Tony Hart Tue, 13 Oct 2015 04:00:08 -0700

> On Oct 13, 2015, at 6:23 AM, Mathivanan Naickan Palanivelu 
> <[email protected]> wrote:
> 
> I don't think this scenario qualifies (not yet) to gain direct relevance to 
> the multiple-standby concepts yet(headless sounds a misnomer to me, that 
> apart)!
> Could you please explain the following:
> - What is the redundancy model of the distributed applications running on 
> those 3 nodes i.e. (Server, linecard1, linecard2)


Currently Server=system-controller, linecard* = payload.  This leads to SOPF, 
what I would like is all cards to be system-controllers.  Technically two 
controllers would be enough but then I have to decide which linecard is the 
other controller and handle the case where that linecard is removed.   

The redundancy model would be active-standby for the “controller” processes and 
no-redundancy for the linecard processes, here we would run both controller and 
linecard processes on the linecards.

> - Are your applications running as SA-AWARE or non-SA-AWARE?

Most are ha-aware naturally, there are a small number that are not naturally 
ha-aware, for those we have ha-aware wrapper processes.  So I think the answer 
is all processes are HA-aware.

> - What would be the typical heartbeat interval period.

If you mean TIPC-timeout, its currently set to 10 seconds.

> 
> You would require a customized handling (like delaying or avoiding reboot for 
> a while)
> 
> Thanks,
> Mathi.
> 
>> -----Original Message-----
>> From: Tony Hart [mailto:[email protected]]
>> Sent: Tuesday, October 13, 2015 3:37 PM
>> To: Anders Björnerstedt
>> Cc: [email protected]
>> Subject: Re: [users] Avoid rebooting payload modules after losing system
>> controller
>> 
>> 
>> Understood.  The assumption is that this is temporary but we allow the
>> payloads to continue to run (with reduced osaf functionality) until a
>> replacement controller is found.  At that point they can reboot to get the
>> system back into sync.
>> 
>> Or allow more than 2 controllers in the system so we can have one or more
>> usually-payload cards be controllers to reduce the probability of no-
>> controllers to an acceptable level.
>> 
>> 
>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt
>> <[email protected]> wrote:
>>> 
>>> The headless state is also vulnerable to split-brain scenarios.
>>> That is network partitions and joins can occur and will not be detected as
>> such and thus not handled properly (isolated) when they occur.
>>> Basically you can  not be sure you have a continuously coherent cluster
>> while in the headless state.
>>> 
>>> On paper you may get a very resilient system in the sense that It "stays up"
>> and replies on ping etc.
>>> But typically a customer wants not just availability but reliable behavior
>> also.
>>> 
>>> /AndersBj
>>> 
>>> 
>>> -----Original Message-----
>>> From: Anders Björnerstedt [mailto:[email protected]]
>>> Sent: den 12 oktober 2015 16:42
>>> To: Anders Widell; Tony Hart; [email protected]
>>> Subject: Re: [users] Avoid rebooting payload modules after losing
>>> system controller
>>> 
>>> Note that this headless variant  is a very questionable feature. This for 
>>> the
>> reasons explained earlier, i.e. you *will*  get a reduction in service
>> availability.
>>> It was never accepted into OpenSAF for that reason.
>>> 
>>> On top of that the unreliability will typically not he explicit/handled. 
>>> That is
>> the operator will probably not even know what is working and what is not
>> during the SC absence since the alarm/notification  function is gone. No
>> OpenSAF director services are executing.
>>> 
>>> It is truly a headless system, i.e. a zombie system and thus not working at
>> full monitoring and availability functionality.
>>> It begs the question of what OpenSAF and SAF is there for in the first 
>>> place.
>>> 
>>> The SCs don’t have to run any special software and don’t have to have any
>> special hardware.
>>> They do need file system access, at least for a cluster restart, but not
>> necessarily to handle single SC failure.
>>> The headless variant when headless is also in that not-able-to-cluster-
>> restart also, but with even less functionality.
>>> 
>>> An SC can of course run other (non OpenSAF specific) software.  And the
>> two SCs don’t necessarily have to be symmetric in terms of software.
>>> 
>>> Providing file system access via NFS is typically a non issue. They have 
>>> three
>> nodes. Ergo  they should be able to assign two of them the role of SC in the
>> OpensAF domain.
>>> 
>>> /AndersBj
>>> 
>>> -----Original Message-----
>>> From: Anders Widell [mailto:[email protected]]
>>> Sent: den 12 oktober 2015 16:08
>>> To: Tony Hart; [email protected]
>>> Subject: Re: [users] Avoid rebooting payload modules after losing
>>> system controller
>>> 
>>> We have actually implemented something very similar to what you are
>> talking about. With this feature, the payloads can survive without a cluster
>> restart even if both system controllers restart (or the single system
>> controller, in your case). If you want to try it out, you can clone this 
>> Mercurial
>> repository:
>>> 
>>> https://sourceforge.net/u/anders-w/opensaf-headless/
>>> 
>>> To enable the feature, set the variable IMMSV_SC_ABSENCE_ALLOWED in
>> immd.conf to the amount of seconds you wish the payloads to wait for the
>> system controllers to come back. Note: we have only implemented this
>> feature for the "core" OpenSAF services (plus CKPT), so you need to disable
>> the optional serivces.
>>> 
>>> / Anders Widell
>>> 
>>> On 10/11/2015 02:30 PM, Tony Hart wrote:
>>>> We have been using opensaf in our product for a couple of years now.
>> One of the issues we have is the fact that payload cards reboot when the
>> system controllers are lost.  Although our payload card hardware will
>> continue to perform its functions whilst the software is down (which is
>> desirable) the functions that the software performs are obviously not
>> performed (which is not desirable).
>>>> 
>>>> Why would we loose both controllers, surely that is a rare circumstance?
>> Not if you only have one controller to begin with.  Removing the second
>> controller is a significant cost saving for us so we want to support a 
>> product
>> that only has one controller.  The most significant impediment to that is the
>> loss of payload software functions when the system controller fails.
>>>> 
>>>> I’m looking for suggestions from this email list as to what could be done
>> for this issue.
>>>> 
>>>> One suggestion, that would work for us, is if we could convince the
>> payload card to only reboot when the controller reappears after a loss rather
>> than when the loss initially occurs.  Is that possible?
>>>> 
>>>> Another possibility is if we could support more than 2 controllers, for
>> example if we could support 4 (one active and 3 standbys) that would also
>> provide a solution for us (our current payloads would instead become
>> controllers).  I know that this is not currently possible with opensaf.
>>>> 
>>>> thanks for any suggestions,
>>>> —
>>>> tony
>>>> ---------------------------------------------------------------------
>>>> -
>>>> -------- _______________________________________________
>>>> Opensaf-users mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>> 
>>> 
>>> 
>>> ----------------------------------------------------------------------
>>> -------- _______________________________________________
>>> Opensaf-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>> ----------------------------------------------------------------------
>>> -------- _______________________________________________
>>> Opensaf-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>> 
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Opensaf-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/opensaf-users

------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Avoid rebooting payload modules after losing system controller

Reply via email to