Re: [Users] Controller HA mechanisms

Hans Feldt Thu, 29 Nov 2007 02:02:43 -0800

What do you mean by "configure NID config"?

Change nid so that stays alive and supervise its children?


Thanks,
Hans

Gudipalli S.-G19449 wrote:
> Hi hans,
> 
> 
>    In a system the controller nodes that will have RDE/SCAP will power
> on by self.
> 
> Case both blades in the box, the power to the box is applied:
> -------------------------------------------------------------
> 
> Sub case1:
> ----------
> RDE on node 1 becomes active.
> SCAP on node 1 starts but fails to complete
> Init successfully.
> 
> We expect the platform vendor porting openSAF to configure NID config to
> Reboot the node on failure or have his platform mechanisms do that for
> him.
> 
> Sub case2:
> ----------
> RDE on node 1 becomes active.
> SCAP on node 1 starts completes Init successfully.
> Immediately afterwards crashes.
> 
> Since the two nodes node 1 and node 2 were in the box when the power was
> applied
> We expect that given small variations in the boot times node 2 will be
> at SCAP initialization
> Before node 1 is successfully initialized. Since the other RDE/SCAP is
> there this
> Situation is also solved.
> 
> Case one blade in the box, the power to the box is applied:
> ----------------------------------------------------------------
> 
> Sub case1:
> ----------
> RDE on node 1 becomes active.
> SCAP on node 1 starts but fails to complete
> Init successfully.
> 
> We expect the platform vendor porting openSAF to configure NID config to
> Reboot the node on failure or have his platform mechanisms do that for
> him.
> 
> Sub case2:
> ----------
> RDE on node 1 becomes active.
> SCAP on node 1 starts completes Init successfully.
> Immediately afterwards crashes.
> 
> This is a double fault case a manual repair of restarting this single
> node
> Is required. If the platform is normally run like this then the platform
> Vendor can have his fault manager track SCAP and on its death take the
> Necessary recover/repair actions.
> 
> Regards
> Sugadeesh
> 
>> -----Original Message-----
>> From: [EMAIL PROTECTED] 
>> [mailto:[EMAIL PROTECTED] On Behalf Of Hans Feldt
>> Sent: Thursday, November 29, 2007 2:43 AM
>> To: Saha Sayandeb-G19428
>> Cc: [email protected]
>> Subject: Re: [Users] Controller HA mechanisms
>>
>>  
>>
>>> -----Original Message-----
>>> From: Saha Sayandeb-G19428 [mailto:[EMAIL PROTECTED]
>>> Sent: den 28 november 2007 18:23
>>> To: Hans Feldt
>>> Cc: [email protected]
>>> Subject: RE: [Users] Controller HA mechanisms
>>>
>>> Hans,
>>>
>>> Comments below ... 
>>>
>>>> How does OpenSAF handle the following scenario:
>>>>
>>>> - Controller 1 (C1) power on
>>>> - C1 RDE starts and decides to be active since it is alone in the 
>>>> cluster
>>>> - C1 PSR or AMF dies due to some reason
>>>> - Controller 2 (C2) power on
>>>> - C2 RDE starts and gets the role standby from RDE on C1
>>>> - C2 waits forever to get synced from C1
>>>>
>>>> Some issues:
>>>> C1 RDE claims to be active although it is not
>>>> C1 does not reboot
>>>> C2 does not reboot when its looses contact with the active
>>> controller
>>>> and not in sync.
>>>> C2 cannot become active if we reboot C1
>>>>
>>>> Comments?
>>> [SS] I simulated this condition quite easily by simply killing the 
>>> ncs_scap process in the one and only active controller and then 
>>> running the get_ha_state command  and as you say the RDE in this 
>>> controller still keeps thinking that it is active which 
>> prevents the 
>>> second controller to obtain the active state. So this is a 
>> hole as the 
>>> RDE has no clue that the Avd+AvM has crashed. I guess we 
>> could add a 
>>> role heart-beat from the Avd+AvM to the RDE to ensure that 
>> the RDE is 
>>> always in-synch with what's going on and can relinquish the active 
>>> state so that the other controller can become active under such a 
>>> circumstance.
>>> But this whole scenario of having only one controller which crashes 
>>> and then the second one that tries to come up is probably not so 
>>> common or do you think it will be because of the way 
>> OpenSAF waits 3 
>>> minutes before rebooting payload blades when AvD goes down?
>> No I just stumbled on this since we're doing a lot power 
>> on/off of controllers and fail-overs at the moment.
>>
>> As a solution, what if nid stays alive and supervise its 
>> children? If rde or scap dies, nid reboots the system.
>>
>> Cheers,
>> Hans
>>
>>> Sayan
>>>
>>>> Regards,
>>>> Hans
>>>> _______________________________________________
>>>> Users mailing list
>>>> [email protected]
>>>> http://list.opensaf.org/maillist/listinfo/users
>>>>
>> _______________________________________________
>> Users mailing list
>> [email protected]
>> http://list.opensaf.org/maillist/listinfo/users
>>
> 

_______________________________________________
Users mailing list
[email protected]
http://list.opensaf.org/maillist/listinfo/users

Re: [Users] Controller HA mechanisms

Reply via email to