Re: [users] Avoid rebooting payload modules after losing system controller

Anders Widell Tue, 13 Oct 2015 07:34:03 -0700

Yes, this is yet another approach. But it is also another use-case for 
the headless feature. When we have moved the system controllers out of 
the cluster (into the cloud infrastructure), I would expect controllers 
and payloads to have independent life cycles. You have servers (i.e. 
system controllers), and clients (payloads). They can be installed and 
upgraded separately from each other, and I wouldn't expect a restart of 
the servers to cause all the clients to restart as well, in the same way 
as I don't expect my web browser to restart just because because the web 
server has crashed.


/ Anders Widell

On 10/13/2015 03:54 PM, Mathivanan Naickan Palanivelu wrote:
> I don't think this is a case of cattles! Even in those scenario
> the cloud management stacks, the  "**controller" software themselves are 
> 'placed' on physical nodes
> in appropriate redundancy models and not inside those cattle VMs!
>
> I think the case here is about avoid rebooting of the node!
> This can be achieved by setting the NOACTIVE timer to a longer value till 
> OpenSAF on the controller comes back up.
> Upon detecting that the controllers are up, some entity on the local node 
> restart OpenSAF (/etc/init.d/opensafd restart)
> And ensure the CLC-CLI scripts of the applications differentiate usual 
> restart versus this spoof-restart!
>
> Mathi.
>
>> -----Original Message-----
>> From: Anders Widell [mailto:[email protected]]
>> Sent: Tuesday, October 13, 2015 5:36 PM
>> To: Anders Björnerstedt; Tony Hart
>> Cc: [email protected]
>> Subject: Re: [users] Avoid rebooting payload modules after losing system
>> controller
>>
>> Yes, I agree that the best fit for this feature is an application using 
>> either the
>> NWay-Active or the No-Redundancy models, and where you view the
>> system more as a collection of nodes rather than as a cluster. This kind of
>> architecture is quite common when you write applications for cloud. The
>> redundancy models are suitable for scaling, and the architecture fits into 
>> the
>> "cattle" philosophy which is common in cloud.
>> Such an application can tolerate any number of node failures, and the
>> remaining nodes would still be able to continue functioning and provide their
>> service. However, if we put the OpenSAF middleware on the nodes it
>> becomes the weakest link, since OpenSAF will reboot all the nodes just
>> because the two controller nodes fail. What a pity on a system with one
>> hundred nodes!
>>
>> / Anders Widell
>>
>> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote:
>>>
>>> On 10/13/2015 12:27 PM, Anders Widell wrote:
>>>> The possibility to have more than two system controllers (one active
>>>> + several standby and/or spare controller nodes) is also something
>>>> that has been investigated. For scalability reasons, we probably
>>>> can't turn all nodes into standby controllers in a large cluster -
>>>> but it may be feasible to have a system with one or several standby
>>>> controllers and the rest of the nodes are spares that are ready to
>>>> take an active or standby assignment when needed.
>>>>
>>>> However, the "headless" feature will still be needed in some systems
>>>> where you need dedicated controller node(s).
>>> That sounds as if some deployments have a special requirement that can
>>> only be supported by the headless feature.
>>> But you also have to say that the headless feature places
>>> anti-requirements on the deployments/applications that are to use it.
>>>
>>> For example not needing cluster coherence among the payloads.
>>>
>>> If the payloads only run independent application instances where each
>>> instance is implemented at one processor or at least does not
>>> communicate in any state-sensitive way with peer processes at other
>>> payloads; and no such instance is unique or if it is unique it is
>>> still expendable (non critical to the service), then it could work.
>>>
>>> It is important the the deployments that end up thinking they need the
>>> headless feature also understand what they loose with the headless
>>> feature and that this loss is acceptable for that deployment.
>>>
>>> So headless is not a fancy feature needed by some exclusive and picky
>>> subset of applications.
>>> It is a relaxation that drops all requirements on distributed
>>> consistency and may be acceptable to some applications with weaker
>>> demands so they can accept the anti requirements.
>>>
>>> Besides requiring "dedicated" controller nodes, the deployment must of
>>> course NOT require any *availability* of those dedicated controller
>>> nodes, i.e. not have any requirements on service availability in
>>> general.
>>>
>>> It may works for some "dumb" applications that are stateless, or state
>>> stable (frozen in state), or have no requirements on availability of
>>> state. In other words some applicaitons that really dont need SAF.
>>>
>>> They may still want to use SAF as a way of managing and monitoring the
>>> system when it happens to be healthy, but can live with  long periods
>>> of not being able to manage or monitor that system, which can then be
>>> degrading in any way that is possible.
>>>
>>>
>>> /AndersBJ
>>>
>>>
>>>
>>>> / Anders Widell
>>>>
>>>> On 10/13/2015 12:07 PM, Tony Hart wrote:
>>>>> Understood.  The assumption is that this is temporary but we allow
>>>>> the payloads to continue to run (with reduced osaf functionality)
>>>>> until a replacement controller is found.  At that point they can
>>>>> reboot to get the system back into sync.
>>>>>
>>>>> Or allow more than 2 controllers in the system so we can have one or
>>>>> more usually-payload cards be controllers to reduce the probability
>>>>> of no-controllers to an acceptable level.
>>>>>
>>>>>
>>>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> The headless state is also vulnerable to split-brain scenarios.
>>>>>> That is network partitions and joins can occur and will not be
>>>>>> detected as such and thus not handled properly (isolated) when they
>>>>>> occur.
>>>>>> Basically you can  not be sure you have a continuously coherent
>>>>>> cluster while in the headless state.
>>>>>>
>>>>>> On paper you may get a very resilient system in the sense that It
>>>>>> "stays up"  and replies on ping etc.
>>>>>> But typically a customer wants not just availability but reliable
>>>>>> behavior also.
>>>>>>
>>>>>> /AndersBj
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Anders Björnerstedt
>> [mailto:[email protected]]
>>>>>> Sent: den 12 oktober 2015 16:42
>>>>>> To: Anders Widell; Tony Hart; [email protected]
>>>>>> Subject: Re: [users] Avoid rebooting payload modules after losing
>>>>>> system controller
>>>>>>
>>>>>> Note that this headless variant  is a very questionable feature.
>>>>>> This for the reasons explained earlier, i.e. you *will*  get a
>>>>>> reduction in service availability.
>>>>>> It was never accepted into OpenSAF for that reason.
>>>>>>
>>>>>> On top of that the unreliability will typically not he
>>>>>> explicit/handled. That is the operator will probably not even know
>>>>>> what is working and what is not during the SC absence since the
>>>>>> alarm/notification  function is gone. No OpenSAF director services
>>>>>> are executing.
>>>>>>
>>>>>> It is truly a headless system, i.e. a zombie system and thus not
>>>>>> working at full monitoring and availability functionality.
>>>>>> It begs the question of what OpenSAF and SAF is there for in the
>>>>>> first place.
>>>>>>
>>>>>> The SCs don’t have to run any special software and don’t have to
>>>>>> have any special hardware.
>>>>>> They do need file system access, at least for a cluster restart,
>>>>>> but not necessarily to handle single SC failure.
>>>>>> The headless variant when headless is also in that
>>>>>> not-able-to-cluster-restart also, but with even less functionality.
>>>>>>
>>>>>> An SC can of course run other (non OpenSAF specific) software.  And
>>>>>> the two SCs don’t necessarily have to be symmetric in terms of
>>>>>> software.
>>>>>>
>>>>>> Providing file system access via NFS is typically a non issue. They
>>>>>> have three nodes. Ergo  they should be able to assign two of them
>>>>>> the role of SC in the OpensAF domain.
>>>>>>
>>>>>> /AndersBj
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Anders Widell [mailto:[email protected]]
>>>>>> Sent: den 12 oktober 2015 16:08
>>>>>> To: Tony Hart; [email protected]
>>>>>> Subject: Re: [users] Avoid rebooting payload modules after losing
>>>>>> system controller
>>>>>>
>>>>>> We have actually implemented something very similar to what you are
>>>>>> talking about. With this feature, the payloads can survive without
>>>>>> a cluster restart even if both system controllers restart (or the
>>>>>> single system controller, in your case). If you want to try it out,
>>>>>> you can clone this Mercurial repository:
>>>>>>
>>>>>> https://sourceforge.net/u/anders-w/opensaf-headless/
>>>>>>
>>>>>> To enable the feature, set the variable
>> IMMSV_SC_ABSENCE_ALLOWED in
>>>>>> immd.conf to the amount of seconds you wish the payloads to wait
>>>>>> for the system controllers to come back. Note: we have only
>>>>>> implemented this feature for the "core" OpenSAF services (plus
>>>>>> CKPT), so you need to disable the optional serivces.
>>>>>>
>>>>>> / Anders Widell
>>>>>>
>>>>>> On 10/11/2015 02:30 PM, Tony Hart wrote:
>>>>>>> We have been using opensaf in our product for a couple of years
>>>>>>> now.  One of the issues we have is the fact that payload cards
>>>>>>> reboot when the system controllers are lost.  Although our payload
>>>>>>> card hardware will continue to perform its functions whilst the
>>>>>>> software is down (which is desirable) the functions that the
>>>>>>> software performs are obviously not performed (which is not
>>>>>>> desirable).
>>>>>>>
>>>>>>> Why would we loose both controllers, surely that is a rare
>>>>>>> circumstance?  Not if you only have one controller to begin with.
>>>>>>> Removing the second controller is a significant cost saving for us
>>>>>>> so we want to support a product that only has one controller.  The
>>>>>>> most significant impediment to that is the loss of payload
>>>>>>> software functions when the system controller fails.
>>>>>>>
>>>>>>> I’m looking for suggestions from this email list as to what could
>>>>>>> be done for this issue.
>>>>>>>
>>>>>>> One suggestion, that would work for us, is if we could convince
>>>>>>> the payload card to only reboot when the controller reappears
>>>>>>> after a loss rather than when the loss initially occurs.  Is that
>>>>>>> possible?
>>>>>>>
>>>>>>> Another possibility is if we could support more than 2
>>>>>>> controllers, for example if we could support 4 (one active and 3
>>>>>>> standbys) that would also provide a solution for us (our current
>>>>>>> payloads would instead become controllers). I know that this is
>>>>>>> not currently possible with opensaf.
>>>>>>>
>>>>>>> thanks for any suggestions,
>>>>>>> —
>>>>>>> tony
>>>>>>> ------------------------------------------------------------------
>>>>>>> ----
>>>>>>>
>>>>>>> -------- _______________________________________________
>>>>>>> Opensaf-users mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>>>>>
>>>>>> -------------------------------------------------------------------
>>>>>> -----------
>>>>>>
>>>>>> _______________________________________________
>>>>>> Opensaf-users mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>>>>> -------------------------------------------------------------------
>>>>>> -----------
>>>>>>
>>>>>> _______________________________________________
>>>>>> Opensaf-users mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>>>
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Opensaf-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/opensaf-users



------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Avoid rebooting payload modules after losing system controller

Reply via email to