Re: [users] Avoid rebooting payload modules after losing system controller

Tony Hart Tue, 13 Oct 2015 03:08:17 -0700

Understood.  The assumption is that this is temporary but we allow the payloads 
to continue to run (with reduced osaf functionality) until a replacement 
controller is found.  At that point they can reboot to get the system back into 
sync.


Or allow more than 2 controllers in the system so we can have one or more 
usually-payload cards be controllers to reduce the probability of 
no-controllers to an acceptable level.


> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt 
> <[email protected]> wrote:
> 
> The headless state is also vulnerable to split-brain scenarios.
> That is network partitions and joins can occur and will not be detected as 
> such and thus not handled properly (isolated) when they occur.
> Basically you can  not be sure you have a continuously coherent cluster while 
> in the headless state.
> 
> On paper you may get a very resilient system in the sense that It "stays up"  
> and replies on ping etc.
> But typically a customer wants not just availability but reliable behavior 
> also.
> 
> /AndersBj
> 
> 
> -----Original Message-----
> From: Anders Björnerstedt [mailto:[email protected]] 
> Sent: den 12 oktober 2015 16:42
> To: Anders Widell; Tony Hart; [email protected]
> Subject: Re: [users] Avoid rebooting payload modules after losing system 
> controller
> 
> Note that this headless variant  is a very questionable feature. This for the 
> reasons explained earlier, i.e. you *will*  get a reduction in service 
> availability.
> It was never accepted into OpenSAF for that reason. 
> 
> On top of that the unreliability will typically not he explicit/handled. That 
> is the operator will probably not even know what is working and what is not 
> during the SC absence since the alarm/notification  function is gone. No 
> OpenSAF director services are executing.
> 
> It is truly a headless system, i.e. a zombie system and thus not working at 
> full monitoring and availability functionality.
> It begs the question of what OpenSAF and SAF is there for in the first place.
> 
> The SCs don’t have to run any special software and don’t have to have any 
> special hardware.
> They do need file system access, at least for a cluster restart, but not 
> necessarily to handle single SC failure.
> The headless variant when headless is also in that 
> not-able-to-cluster-restart also, but with even less functionality.
> 
> An SC can of course run other (non OpenSAF specific) software.  And the two 
> SCs don’t necessarily have to be symmetric in terms of software. 
> 
> Providing file system access via NFS is typically a non issue. They have 
> three nodes. Ergo  they should be able to assign two of them the role of SC 
> in the OpensAF domain.
> 
> /AndersBj
> 
> -----Original Message-----
> From: Anders Widell [mailto:[email protected]]
> Sent: den 12 oktober 2015 16:08
> To: Tony Hart; [email protected]
> Subject: Re: [users] Avoid rebooting payload modules after losing system 
> controller
> 
> We have actually implemented something very similar to what you are talking 
> about. With this feature, the payloads can survive without a cluster restart 
> even if both system controllers restart (or the single system controller, in 
> your case). If you want to try it out, you can clone this Mercurial 
> repository:
> 
> https://sourceforge.net/u/anders-w/opensaf-headless/
> 
> To enable the feature, set the variable IMMSV_SC_ABSENCE_ALLOWED in immd.conf 
> to the amount of seconds you wish the payloads to wait for the system 
> controllers to come back. Note: we have only implemented this feature for the 
> "core" OpenSAF services (plus CKPT), so you need to disable the optional 
> serivces.
> 
> / Anders Widell
> 
> On 10/11/2015 02:30 PM, Tony Hart wrote:
>> We have been using opensaf in our product for a couple of years now.  One of 
>> the issues we have is the fact that payload cards reboot when the system 
>> controllers are lost.  Although our payload card hardware will continue to 
>> perform its functions whilst the software is down (which is desirable) the 
>> functions that the software performs are obviously not performed (which is 
>> not desirable).
>> 
>> Why would we loose both controllers, surely that is a rare circumstance?  
>> Not if you only have one controller to begin with.  Removing the second 
>> controller is a significant cost saving for us so we want to support a 
>> product that only has one controller.  The most significant impediment to 
>> that is the loss of payload software functions when the system controller 
>> fails.
>> 
>> I’m looking for suggestions from this email list as to what could be done 
>> for this issue.
>> 
>> One suggestion, that would work for us, is if we could convince the payload 
>> card to only reboot when the controller reappears after a loss rather than 
>> when the loss initially occurs.  Is that possible?
>> 
>> Another possibility is if we could support more than 2 controllers, for 
>> example if we could support 4 (one active and 3 standbys) that would also 
>> provide a solution for us (our current payloads would instead become 
>> controllers).  I know that this is not currently possible with opensaf.
>> 
>> thanks for any suggestions,
>> —
>> tony
>> ----------------------------------------------------------------------
>> -------- _______________________________________________
>> Opensaf-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> 
> 
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> ------------------------------------------------------------------------------
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-users

------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Avoid rebooting payload modules after losing system controller

Reply via email to