What do you mean by "configure NID config"? Change nid so that stays alive and supervise its children?
Thanks, Hans Gudipalli S.-G19449 wrote: > Hi hans, > > > In a system the controller nodes that will have RDE/SCAP will power > on by self. > > Case both blades in the box, the power to the box is applied: > ------------------------------------------------------------- > > Sub case1: > ---------- > RDE on node 1 becomes active. > SCAP on node 1 starts but fails to complete > Init successfully. > > We expect the platform vendor porting openSAF to configure NID config to > Reboot the node on failure or have his platform mechanisms do that for > him. > > Sub case2: > ---------- > RDE on node 1 becomes active. > SCAP on node 1 starts completes Init successfully. > Immediately afterwards crashes. > > Since the two nodes node 1 and node 2 were in the box when the power was > applied > We expect that given small variations in the boot times node 2 will be > at SCAP initialization > Before node 1 is successfully initialized. Since the other RDE/SCAP is > there this > Situation is also solved. > > Case one blade in the box, the power to the box is applied: > ---------------------------------------------------------------- > > Sub case1: > ---------- > RDE on node 1 becomes active. > SCAP on node 1 starts but fails to complete > Init successfully. > > We expect the platform vendor porting openSAF to configure NID config to > Reboot the node on failure or have his platform mechanisms do that for > him. > > Sub case2: > ---------- > RDE on node 1 becomes active. > SCAP on node 1 starts completes Init successfully. > Immediately afterwards crashes. > > This is a double fault case a manual repair of restarting this single > node > Is required. If the platform is normally run like this then the platform > Vendor can have his fault manager track SCAP and on its death take the > Necessary recover/repair actions. > > Regards > Sugadeesh > >> -----Original Message----- >> From: [EMAIL PROTECTED] >> [mailto:[EMAIL PROTECTED] On Behalf Of Hans Feldt >> Sent: Thursday, November 29, 2007 2:43 AM >> To: Saha Sayandeb-G19428 >> Cc: [email protected] >> Subject: Re: [Users] Controller HA mechanisms >> >> >> >>> -----Original Message----- >>> From: Saha Sayandeb-G19428 [mailto:[EMAIL PROTECTED] >>> Sent: den 28 november 2007 18:23 >>> To: Hans Feldt >>> Cc: [email protected] >>> Subject: RE: [Users] Controller HA mechanisms >>> >>> Hans, >>> >>> Comments below ... >>> >>>> How does OpenSAF handle the following scenario: >>>> >>>> - Controller 1 (C1) power on >>>> - C1 RDE starts and decides to be active since it is alone in the >>>> cluster >>>> - C1 PSR or AMF dies due to some reason >>>> - Controller 2 (C2) power on >>>> - C2 RDE starts and gets the role standby from RDE on C1 >>>> - C2 waits forever to get synced from C1 >>>> >>>> Some issues: >>>> C1 RDE claims to be active although it is not >>>> C1 does not reboot >>>> C2 does not reboot when its looses contact with the active >>> controller >>>> and not in sync. >>>> C2 cannot become active if we reboot C1 >>>> >>>> Comments? >>> [SS] I simulated this condition quite easily by simply killing the >>> ncs_scap process in the one and only active controller and then >>> running the get_ha_state command and as you say the RDE in this >>> controller still keeps thinking that it is active which >> prevents the >>> second controller to obtain the active state. So this is a >> hole as the >>> RDE has no clue that the Avd+AvM has crashed. I guess we >> could add a >>> role heart-beat from the Avd+AvM to the RDE to ensure that >> the RDE is >>> always in-synch with what's going on and can relinquish the active >>> state so that the other controller can become active under such a >>> circumstance. >>> But this whole scenario of having only one controller which crashes >>> and then the second one that tries to come up is probably not so >>> common or do you think it will be because of the way >> OpenSAF waits 3 >>> minutes before rebooting payload blades when AvD goes down? >> No I just stumbled on this since we're doing a lot power >> on/off of controllers and fail-overs at the moment. >> >> As a solution, what if nid stays alive and supervise its >> children? If rde or scap dies, nid reboots the system. >> >> Cheers, >> Hans >> >>> Sayan >>> >>>> Regards, >>>> Hans >>>> _______________________________________________ >>>> Users mailing list >>>> [email protected] >>>> http://list.opensaf.org/maillist/listinfo/users >>>> >> _______________________________________________ >> Users mailing list >> [email protected] >> http://list.opensaf.org/maillist/listinfo/users >> > _______________________________________________ Users mailing list [email protected] http://list.opensaf.org/maillist/listinfo/users
