Re: [users] Avoid rebooting payload modules after losing system controller

Anders Björnerstedt Tue, 13 Oct 2015 06:46:00 -0700

Yes and here we have a third approach for addressing the SC failure issue 
(besides headless and "roaming" SC) there is the approach of 
drastically increasing the MTBF for SCs by:


a) Making director processes re-startable, i.e. the failure or termination of a 
director process should not have to pull down the whole SC.

b) While keeping the concept of active and standby director process one can 
drop the concept of active and standby SC. That is the
different active (or standby) directors do not necessarily have to run at the 
same SC.  The only real requirement is that active and
standby directors for any given service are not located on the same SC. 

When a director terminates or crashes, instead of pulling down the SC, you 
would fail-over only that service.
This would of course also open up for si-swap for individual service directors.

It would of course require a fair  bit of work. But I expect it would be less 
work than the work already put into the headless prototype
and the end result would be much cleaner both conceptually and in the code. 
Almost no-one *really* understands how the headless 
prototype works and its side effects.
For example: are  runtime attributes readable or not during SC absence ? If 
they are readable, what does the value reflect ? 
If not readable, why? Or more importantly for how long?

 Restartable directors would be easier to understand for users and there would 
be much less to understand since the difference in terms of functionality
would be much smaller. Total availability would improve instead of being 
degraded. 

/AndersBj

-----Original Message-----
From: Anders Widell 
Sent: den 13 oktober 2015 14:06
To: Anders Björnerstedt; Tony Hart
Cc: [email protected]
Subject: Re: [users] Avoid rebooting payload modules after losing system 
controller

Yes, I agree that the best fit for this feature is an application using either 
the NWay-Active or the No-Redundancy models, and where you view the system more 
as a collection of nodes rather than as a cluster. This kind of architecture is 
quite common when you write applications for cloud. The redundancy models are 
suitable for scaling, and the architecture fits into the "cattle" philosophy 
which is common in cloud. 
Such an application can tolerate any number of node failures, and the remaining 
nodes would still be able to continue functioning and provide their service. 
However, if we put the OpenSAF middleware on the nodes it becomes the weakest 
link, since OpenSAF will reboot all the nodes just because the two controller 
nodes fail. What a pity on a system with one hundred nodes!

/ Anders Widell

On 10/13/2015 01:19 PM, Anders Björnerstedt wrote:
>
>
> On 10/13/2015 12:27 PM, Anders Widell wrote:
>> The possibility to have more than two system controllers (one active
>> + several standby and/or spare controller nodes) is also something
>> that has been investigated. For scalability reasons, we probably 
>> can't turn all nodes into standby controllers in a large cluster - 
>> but it may be feasible to have a system with one or several standby 
>> controllers and the rest of the nodes are spares that are ready to 
>> take an active or standby assignment when needed.
>>
>> However, the "headless" feature will still be needed in some systems 
>> where you need dedicated controller node(s).
>
> That sounds as if some deployments have a special requirement that can 
> only be supported by the headless feature.
> But you also have to say that the headless feature places 
> anti-requirements on the deployments/applications that are to use it.
>
> For example not needing cluster coherence among the payloads.
>
> If the payloads only run independent application instances where each 
> instance is implemented at one processor or at least does not 
> communicate in any state-sensitive way with peer processes at other 
> payloads; and no such instance is unique or if it is unique it is 
> still expendable (non critical to the service), then it could work.
>
> It is important the the deployments that end up thinking they need the 
> headless feature also understand what they loose with the headless 
> feature and that this loss is acceptable for that deployment.
>
> So headless is not a fancy feature needed by some exclusive and picky 
> subset of applications.
> It is a relaxation that drops all requirements on distributed 
> consistency and may be acceptable to some applications with weaker 
> demands so they can accept the anti requirements.
>
> Besides requiring "dedicated" controller nodes, the deployment must of 
> course NOT require any *availability* of those dedicated controller 
> nodes, i.e. not have any requirements on service availability in 
> general.
>
> It may works for some "dumb" applications that are stateless, or state 
> stable (frozen in state), or have no requirements on availability of 
> state. In other words some applicaitons that really dont need SAF.
>
> They may still want to use SAF as a way of managing and monitoring the 
> system when it happens to be healthy, but can live with  long periods 
> of not being able to manage or monitor that system, which can then be 
> degrading in any way that is possible.
>
>
> /AndersBJ
>
>
>
>>
>> / Anders Widell
>>
>> On 10/13/2015 12:07 PM, Tony Hart wrote:
>>> Understood.  The assumption is that this is temporary but we allow 
>>> the payloads to continue to run (with reduced osaf functionality) 
>>> until a replacement controller is found.  At that point they can 
>>> reboot to get the system back into sync.
>>>
>>> Or allow more than 2 controllers in the system so we can have one or 
>>> more usually-payload cards be controllers to reduce the probability 
>>> of no-controllers to an acceptable level.
>>>
>>>
>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt 
>>>> <[email protected]> wrote:
>>>>
>>>> The headless state is also vulnerable to split-brain scenarios.
>>>> That is network partitions and joins can occur and will not be 
>>>> detected as such and thus not handled properly (isolated) when they 
>>>> occur.
>>>> Basically you can  not be sure you have a continuously coherent 
>>>> cluster while in the headless state.
>>>>
>>>> On paper you may get a very resilient system in the sense that It 
>>>> "stays up"  and replies on ping etc.
>>>> But typically a customer wants not just availability but reliable 
>>>> behavior also.
>>>>
>>>> /AndersBj
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Anders Björnerstedt [mailto:[email protected]]
>>>> Sent: den 12 oktober 2015 16:42
>>>> To: Anders Widell; Tony Hart; [email protected]
>>>> Subject: Re: [users] Avoid rebooting payload modules after losing 
>>>> system controller
>>>>
>>>> Note that this headless variant  is a very questionable feature. 
>>>> This for the reasons explained earlier, i.e. you *will*  get a 
>>>> reduction in service availability.
>>>> It was never accepted into OpenSAF for that reason.
>>>>
>>>> On top of that the unreliability will typically not he 
>>>> explicit/handled. That is the operator will probably not even know 
>>>> what is working and what is not during the SC absence since the 
>>>> alarm/notification  function is gone. No OpenSAF director services 
>>>> are executing.
>>>>
>>>> It is truly a headless system, i.e. a zombie system and thus not 
>>>> working at full monitoring and availability functionality.
>>>> It begs the question of what OpenSAF and SAF is there for in the 
>>>> first place.
>>>>
>>>> The SCs don’t have to run any special software and don’t have to 
>>>> have any special hardware.
>>>> They do need file system access, at least for a cluster restart, 
>>>> but not necessarily to handle single SC failure.
>>>> The headless variant when headless is also in that 
>>>> not-able-to-cluster-restart also, but with even less functionality.
>>>>
>>>> An SC can of course run other (non OpenSAF specific) software.  And 
>>>> the two SCs don’t necessarily have to be symmetric in terms of 
>>>> software.
>>>>
>>>> Providing file system access via NFS is typically a non issue. They 
>>>> have three nodes. Ergo  they should be able to assign two of them 
>>>> the role of SC in the OpensAF domain.
>>>>
>>>> /AndersBj
>>>>
>>>> -----Original Message-----
>>>> From: Anders Widell [mailto:[email protected]]
>>>> Sent: den 12 oktober 2015 16:08
>>>> To: Tony Hart; [email protected]
>>>> Subject: Re: [users] Avoid rebooting payload modules after losing 
>>>> system controller
>>>>
>>>> We have actually implemented something very similar to what you are 
>>>> talking about. With this feature, the payloads can survive without 
>>>> a cluster restart even if both system controllers restart (or the 
>>>> single system controller, in your case). If you want to try it out, 
>>>> you can clone this Mercurial repository:
>>>>
>>>> https://sourceforge.net/u/anders-w/opensaf-headless/
>>>>
>>>> To enable the feature, set the variable IMMSV_SC_ABSENCE_ALLOWED in 
>>>> immd.conf to the amount of seconds you wish the payloads to wait 
>>>> for the system controllers to come back. Note: we have only 
>>>> implemented this feature for the "core" OpenSAF services (plus 
>>>> CKPT), so you need to disable the optional serivces.
>>>>
>>>> / Anders Widell
>>>>
>>>> On 10/11/2015 02:30 PM, Tony Hart wrote:
>>>>> We have been using opensaf in our product for a couple of years 
>>>>> now.  One of the issues we have is the fact that payload cards 
>>>>> reboot when the system controllers are lost.  Although our payload 
>>>>> card hardware will continue to perform its functions whilst the 
>>>>> software is down (which is desirable) the functions that the 
>>>>> software performs are obviously not performed (which is not 
>>>>> desirable).
>>>>>
>>>>> Why would we loose both controllers, surely that is a rare 
>>>>> circumstance?  Not if you only have one controller to begin with.
>>>>> Removing the second controller is a significant cost saving for us 
>>>>> so we want to support a product that only has one controller.  The 
>>>>> most significant impediment to that is the loss of payload 
>>>>> software functions when the system controller fails.
>>>>>
>>>>> I’m looking for suggestions from this email list as to what could 
>>>>> be done for this issue.
>>>>>
>>>>> One suggestion, that would work for us, is if we could convince 
>>>>> the payload card to only reboot when the controller reappears 
>>>>> after a loss rather than when the loss initially occurs.  Is that 
>>>>> possible?
>>>>>
>>>>> Another possibility is if we could support more than 2 
>>>>> controllers, for example if we could support 4 (one active and 3
>>>>> standbys) that would also provide a solution for us (our current 
>>>>> payloads would instead become controllers). I know that this is 
>>>>> not currently possible with opensaf.
>>>>>
>>>>> thanks for any suggestions,
>>>>> —
>>>>> tony
>>>>> ------------------------------------------------------------------
>>>>> ----
>>>>>
>>>>> -------- _______________________________________________
>>>>> Opensaf-users mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>>>
>>>>
>>>> -------------------------------------------------------------------
>>>> -----------
>>>>
>>>> _______________________________________________
>>>> Opensaf-users mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>>> -------------------------------------------------------------------
>>>> -----------
>>>>
>>>> _______________________________________________
>>>> Opensaf-users mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>
>>
>
>
>


------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Avoid rebooting payload modules after losing system controller

Reply via email to