Re: [users] Avoid rebooting payload modules after losing system controller

Tony Hart Tue, 13 Oct 2015 04:05:26 -0700

For us 4 system-controllers would be a sweet spot.

Agreed you would also want headless mode if you had dedicated system 
controllers.  For us however we don’t need dedicated controllers.



> On Oct 13, 2015, at 6:27 AM, Anders Widell <[email protected]> wrote:
> 
> The possibility to have more than two system controllers (one active + 
> several standby and/or spare controller nodes) is also something that has 
> been investigated. For scalability reasons, we probably can't turn all nodes 
> into standby controllers in a large cluster - but it may be feasible to have 
> a system with one or several standby controllers and the rest of the nodes 
> are spares that are ready to take an active or standby assignment when needed.
> 
> However, the "headless" feature will still be needed in some systems where 
> you need dedicated controller node(s).
> 
> / Anders Widell
> 
> On 10/13/2015 12:07 PM, Tony Hart wrote:
>> Understood.  The assumption is that this is temporary but we allow the 
>> payloads to continue to run (with reduced osaf functionality) until a 
>> replacement controller is found.  At that point they can reboot to get the 
>> system back into sync.
>> 
>> Or allow more than 2 controllers in the system so we can have one or more 
>> usually-payload cards be controllers to reduce the probability of 
>> no-controllers to an acceptable level.
>> 
>> 
>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt 
>>> <[email protected]> wrote:
>>> 
>>> The headless state is also vulnerable to split-brain scenarios.
>>> That is network partitions and joins can occur and will not be detected as 
>>> such and thus not handled properly (isolated) when they occur.
>>> Basically you can  not be sure you have a continuously coherent cluster 
>>> while in the headless state.
>>> 
>>> On paper you may get a very resilient system in the sense that It "stays 
>>> up"  and replies on ping etc.
>>> But typically a customer wants not just availability but reliable behavior 
>>> also.
>>> 
>>> /AndersBj
>>> 
>>> 
>>> -----Original Message-----
>>> From: Anders Björnerstedt [mailto:[email protected]]
>>> Sent: den 12 oktober 2015 16:42
>>> To: Anders Widell; Tony Hart; [email protected]
>>> Subject: Re: [users] Avoid rebooting payload modules after losing system 
>>> controller
>>> 
>>> Note that this headless variant  is a very questionable feature. This for 
>>> the reasons explained earlier, i.e. you *will*  get a reduction in service 
>>> availability.
>>> It was never accepted into OpenSAF for that reason.
>>> 
>>> On top of that the unreliability will typically not he explicit/handled. 
>>> That is the operator will probably not even know what is working and what 
>>> is not during the SC absence since the alarm/notification  function is 
>>> gone. No OpenSAF director services are executing.
>>> 
>>> It is truly a headless system, i.e. a zombie system and thus not working at 
>>> full monitoring and availability functionality.
>>> It begs the question of what OpenSAF and SAF is there for in the first 
>>> place.
>>> 
>>> The SCs don’t have to run any special software and don’t have to have any 
>>> special hardware.
>>> They do need file system access, at least for a cluster restart, but not 
>>> necessarily to handle single SC failure.
>>> The headless variant when headless is also in that 
>>> not-able-to-cluster-restart also, but with even less functionality.
>>> 
>>> An SC can of course run other (non OpenSAF specific) software.  And the two 
>>> SCs don’t necessarily have to be symmetric in terms of software.
>>> 
>>> Providing file system access via NFS is typically a non issue. They have 
>>> three nodes. Ergo  they should be able to assign two of them the role of SC 
>>> in the OpensAF domain.
>>> 
>>> /AndersBj
>>> 
>>> -----Original Message-----
>>> From: Anders Widell [mailto:[email protected]]
>>> Sent: den 12 oktober 2015 16:08
>>> To: Tony Hart; [email protected]
>>> Subject: Re: [users] Avoid rebooting payload modules after losing system 
>>> controller
>>> 
>>> We have actually implemented something very similar to what you are talking 
>>> about. With this feature, the payloads can survive without a cluster 
>>> restart even if both system controllers restart (or the single system 
>>> controller, in your case). If you want to try it out, you can clone this 
>>> Mercurial repository:
>>> 
>>> https://sourceforge.net/u/anders-w/opensaf-headless/
>>> 
>>> To enable the feature, set the variable IMMSV_SC_ABSENCE_ALLOWED in 
>>> immd.conf to the amount of seconds you wish the payloads to wait for the 
>>> system controllers to come back. Note: we have only implemented this 
>>> feature for the "core" OpenSAF services (plus CKPT), so you need to disable 
>>> the optional serivces.
>>> 
>>> / Anders Widell
>>> 
>>> On 10/11/2015 02:30 PM, Tony Hart wrote:
>>>> We have been using opensaf in our product for a couple of years now.  One 
>>>> of the issues we have is the fact that payload cards reboot when the 
>>>> system controllers are lost.  Although our payload card hardware will 
>>>> continue to perform its functions whilst the software is down (which is 
>>>> desirable) the functions that the software performs are obviously not 
>>>> performed (which is not desirable).
>>>> 
>>>> Why would we loose both controllers, surely that is a rare circumstance?  
>>>> Not if you only have one controller to begin with.  Removing the second 
>>>> controller is a significant cost saving for us so we want to support a 
>>>> product that only has one controller.  The most significant impediment to 
>>>> that is the loss of payload software functions when the system controller 
>>>> fails.
>>>> 
>>>> I’m looking for suggestions from this email list as to what could be done 
>>>> for this issue.
>>>> 
>>>> One suggestion, that would work for us, is if we could convince the 
>>>> payload card to only reboot when the controller reappears after a loss 
>>>> rather than when the loss initially occurs.  Is that possible?
>>>> 
>>>> Another possibility is if we could support more than 2 controllers, for 
>>>> example if we could support 4 (one active and 3 standbys) that would also 
>>>> provide a solution for us (our current payloads would instead become 
>>>> controllers).  I know that this is not currently possible with opensaf.
>>>> 
>>>> thanks for any suggestions,
>>>> —
>>>> tony
>>>> ----------------------------------------------------------------------
>>>> -------- _______________________________________________
>>>> Opensaf-users mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> Opensaf-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> Opensaf-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> 
> 

------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Avoid rebooting payload modules after losing system controller

Reply via email to