Hans,
        I agree with Sugadeesh. This also answers "what will happen when
C1 has a state of AMF failure and C2 comes up and it also got AMF
failure state?"

Regards
-Nagendra 

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Gudipalli S.-G19449
Sent: Thursday, November 29, 2007 11:12 AM
To: Hans Feldt; Saha Sayandeb-G19428
Cc: [email protected]
Subject: Re: [Users] Controller HA mechanisms

Hi hans,


   In a system the controller nodes that will have RDE/SCAP will power
on by self.

Case both blades in the box, the power to the box is applied:
-------------------------------------------------------------

Sub case1:
----------
RDE on node 1 becomes active.
SCAP on node 1 starts but fails to complete Init successfully.

We expect the platform vendor porting openSAF to configure NID config to
Reboot the node on failure or have his platform mechanisms do that for
him.

Sub case2:
----------
RDE on node 1 becomes active.
SCAP on node 1 starts completes Init successfully.
Immediately afterwards crashes.

Since the two nodes node 1 and node 2 were in the box when the power was
applied We expect that given small variations in the boot times node 2
will be at SCAP initialization Before node 1 is successfully
initialized. Since the other RDE/SCAP is there this Situation is also
solved.

Case one blade in the box, the power to the box is applied:
----------------------------------------------------------------

Sub case1:
----------
RDE on node 1 becomes active.
SCAP on node 1 starts but fails to complete Init successfully.

We expect the platform vendor porting openSAF to configure NID config to
Reboot the node on failure or have his platform mechanisms do that for
him.

Sub case2:
----------
RDE on node 1 becomes active.
SCAP on node 1 starts completes Init successfully.
Immediately afterwards crashes.

This is a double fault case a manual repair of restarting this single
node Is required. If the platform is normally run like this then the
platform Vendor can have his fault manager track SCAP and on its death
take the Necessary recover/repair actions.

Regards
Sugadeesh

> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Hans Feldt
> Sent: Thursday, November 29, 2007 2:43 AM
> To: Saha Sayandeb-G19428
> Cc: [email protected]
> Subject: Re: [Users] Controller HA mechanisms
> 
>  
> 
> > -----Original Message-----
> > From: Saha Sayandeb-G19428 [mailto:[EMAIL PROTECTED]
> > Sent: den 28 november 2007 18:23
> > To: Hans Feldt
> > Cc: [email protected]
> > Subject: RE: [Users] Controller HA mechanisms
> > 
> > Hans,
> > 
> > Comments below ... 
> > 
> > > How does OpenSAF handle the following scenario:
> > > 
> > > - Controller 1 (C1) power on
> > > - C1 RDE starts and decides to be active since it is alone in the 
> > > cluster
> > > - C1 PSR or AMF dies due to some reason
> > > - Controller 2 (C2) power on
> > > - C2 RDE starts and gets the role standby from RDE on C1
> > > - C2 waits forever to get synced from C1
> > > 
> > > Some issues:
> > > C1 RDE claims to be active although it is not
> > > C1 does not reboot
> > > C2 does not reboot when its looses contact with the active
> > controller
> > > and not in sync.
> > > C2 cannot become active if we reboot C1
> > > 
> > > Comments?
> > 
> > [SS] I simulated this condition quite easily by simply killing the 
> > ncs_scap process in the one and only active controller and then 
> > running the get_ha_state command  and as you say the RDE in this 
> > controller still keeps thinking that it is active which
> prevents the
> > second controller to obtain the active state. So this is a
> hole as the
> > RDE has no clue that the Avd+AvM has crashed. I guess we
> could add a
> > role heart-beat from the Avd+AvM to the RDE to ensure that
> the RDE is
> > always in-synch with what's going on and can relinquish the active 
> > state so that the other controller can become active under such a 
> > circumstance.
> > But this whole scenario of having only one controller which crashes 
> > and then the second one that tries to come up is probably not so 
> > common or do you think it will be because of the way
> OpenSAF waits 3
> > minutes before rebooting payload blades when AvD goes down?
> 
> No I just stumbled on this since we're doing a lot power on/off of 
> controllers and fail-overs at the moment.
> 
> As a solution, what if nid stays alive and supervise its children? If 
> rde or scap dies, nid reboots the system.
> 
> Cheers,
> Hans
> 
> > 
> > Sayan
> > 
> > > Regards,
> > > Hans
> > > _______________________________________________
> > > Users mailing list
> > > [email protected]
> > > http://list.opensaf.org/maillist/listinfo/users
> > > 
> > 
> _______________________________________________
> Users mailing list
> [email protected]
> http://list.opensaf.org/maillist/listinfo/users
> 
_______________________________________________
Users mailing list
[email protected]
http://list.opensaf.org/maillist/listinfo/users
_______________________________________________
Users mailing list
[email protected]
http://list.opensaf.org/maillist/listinfo/users

Reply via email to