Ok, now I see what you mean. In the case of failover without OS reboot, the FM process is terminated and will not perform the waiting.
/ Anders Widell 2014-02-04 12:56, Mathivanan Naickan Palanivelu skrev: > ----- anders.wid...@ericsson.com wrote: > >> I am not sure if I fully understand the use case. You say that this is >> >> for failover without OS reboot, but at the same time you say that FM >> will stay alive until it is killed by the OS reboot, and the peer FM >> will not take over until it receives a service down of FM. Could you >> elaborate a bit on this? >> > Support for failover without OS reboot would mean support for > failover by means of /etc/init.d/opensafd 'stop' without rmmod tipc > (i.e. The OS is up, non-OpenSAF applications are on and using TIPC). > Now, a mechanism that supports this scenario should also handle the > case when AMFND 'crashes' and node reboot got trigerred. > > In both the cases, the peer FM will failover only after downs of > all - AMFD, AMFND, IMMD, IMMND, FM are received. > > - In the 'stop' case, all processes are terminated by AMFND first. > Subsequently > AMFND and AMFD exit last. > - In the 'amfnd crash' case, all middleware processes terminate themselves > upon detecting > AMFND crash. But local FM will not exit upon detecting AMFND crash because > there is no guarantee that application processes have exited at that point of > time and therefore the local FM will 'wait' until the OS terminates it. > > It is for this 'waiting' period, that the proposal to run a system_fencer > script > on systems with delayed(by design or fault) reboots, has been made in a > separate patch > for TLC approval. > > > Cheers, > Mathi. > >> thanks >> / Anders Widell >> >> 2014-01-29 19:00, mathi.naic...@oracle.com skrev: >>> Summary: failover using OPENSAF_MANAGE_TIPC flag and failover after >> down of critical services [#721] >>> Review request for Trac Ticket(s): #721 >>> Peer Reviewer(s): ramesh.bet...@oracle.com, >> anders.wid...@ericsson.com >>> Pull request to: <<LIST THE PERSON WITH PUSH ACCESS HERE>> >>> Affected branch(es): opensaf-4.4.x, default >>> Development branch: <<IF ANY GIVE THE REPO URL>> >>> >>> -------------------------------- >>> Impacted area Impact y/n >>> -------------------------------- >>> Docs n >>> Build system n >>> RPM/packaging n >>> Configuration files n >>> Startup scripts n >>> SAF services n >>> OpenSAF services y >>> Core libraries n >>> Samples n >>> Tests n >>> Other n >>> >>> >>> Comments (indicate scope for each "y" above): >>> --------------------------------------------- >>> Two cases are supported by this patch >>> >>> 1) Failover in 4.4.x, and default will be the same as previous >> releases. >>> i.e. as usual FM trigerres failover upon receiving the NODE_DOWN >> event. >>> 2) The main goal of the patch is w.r.t failover without OS reboot >> through >>> /etc/init.d/opensafd stop is more controlled now in the scenario >> "when amfnd itself crashes". >>> FM trigers failover after DOWNs of AMF, IMM and FM. >>> >>> More info of these cases: >>> 1) The flag OPENSAF_MANAGE_TIPC=yes is used to control when failover >> is trigerred. >>> This way, the default failover behaviour will now be the same as the >> previous releases. >>> There is no change involved. >>> >>> 2) In the usecase of failover (involving /etc/init.d/opensafd stop) >> without OS reboot cycle, >>> the flag nid.conf is set to OPENSAF_MANAGE_TIPC=no. This usecase is >> currently >>> intact, however some considerations need to be made ito handle the >> scenrio when amfnd itself crashes. >>> For this FM shall subscribe to service downs of AMFD, AMFND, IMMD, >> IMMND AND FM and "FM >>> by way of installing amfnd_down_callback shall exit only when the OS >> reboot terminates FM". >>> The peer FM will the failover once downs of all these services are >> received. >>> changeset c1875b7073b5fa2a1b9ae8755a15f6e8a6bf1aaf >>> Author: Mathivanan N.P.<mathi.naic...@oracle.com> >>> Date: Wed, 29 Jan 2014 23:08:05 +0530 >>> >>> fm: failover using OPENSAF_MANAGE_TIPC flag and subscribe to AMF, >> IMM downs >>> [#721] To support failover without OS reboot, FM subscribed to >> AMFND down >>> events. But this may not be sufficient in scenarios when AMFND >> itself >>> crashes or exits. In this scenario, the exit/kill of OpenSAF >> processes need >>> not be in order and immediate. This can create a scenario where >> some OpenSAF >>> process may still be running, but FM has already started failover >>> processing. To avoid this, FM subscribes to the down events of >> critical >>> opensaf services - AMFD, AMFND, IMMD, IMMND In a false/quick >> failover >>> scenario, these are the services that can lead to problems like >>> implementerset not cleared or dangling AMF state assignments. >> Currently, all >>> OpenSAF services exit upon receiving the AMFND down events, but >> even this >>> does not guarantee immediate and ordered delivery of down events. >> The next >>> patch in this series of 2 patches makes FM to install >>> ava_amfnd_down_callback such that FM waits forever till the OS >> kills FM. >>> changeset 77232a084d58b68c8b470f7b90fa6a2df541183d >>> Author: Mathivanan N.P.<mathi.naic...@oracle.com> >>> Date: Wed, 29 Jan 2014 23:13:27 +0530 >>> >>> fm: install ava_install_amf_down_cb and wait till OS reboot >> terminates FM >>> [#721] FM installs amfnd down callback and upon receiving the amfnd >> down >>> event, shall wait till the OS reboot(trigerred by amfnd crash or >> exit) >>> terminates FM. The STANDBY FM will now start the failover after >> recieving >>> the downs of all AMFD, AMFND, IMMD, IMMND and FM >>> >>> >>> Complete diffstat: >>> ------------------ >>> osaf/services/infrastructure/fm/fms/fm.h | 2 + >>> osaf/services/infrastructure/fm/fms/fm_amf.c | 18 >> +++++++++++ >>> osaf/services/infrastructure/fm/fms/fm_cb.h | 8 +++++ >>> osaf/services/infrastructure/fm/fms/fm_evt.h | 2 + >>> osaf/services/infrastructure/fm/fms/fm_main.c | 86 >> ++++++++++++++++++++++++++++++++++++++++++++++++++++-- >>> osaf/services/infrastructure/fm/fms/fm_mds.c | 155 >> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------ >>> osaf/services/infrastructure/fm/fms/fm_mds.h | 1 + >>> osaf/services/infrastructure/nid/config/nid.conf | 2 +- >>> 8 files changed, 222 insertions(+), 52 deletions(-) >>> >>> >>> Testing Commands: >>> ----------------- >>> 1) Start a opensaf cluster. Trigerr failover by all known means >>> i.e. /etc/init.d/opensafd stop or kill or director components >>> or of amnfnd. >>> >>> 2) edit nid.conf and set OPENSAF_MANAGE_TIPC=no. >>> /usr/lib64/opensaf/configure_tipc start eth0 1234 >>> Start opensaf. >>> Trigger failover by all known means. >>> >>> Testing, Expected Results: >>> -------------------------- >>> Failover should work. >>> >>> Conditions of Submission: >>> ------------------------- >>> Ack from Ramesh and/or AndersW >>> >>> Arch Built Started Linux distro >>> ------------------------------------------- >>> mips n n >>> mips64 n n >>> x86 n n >>> x86_64 y y >>> powerpc n n >>> powerpc64 n n >>> >>> >>> Reviewer Checklist: >>> ------------------- >>> [Submitters: make sure that your review doesn't trigger any >> checkmarks!] >>> >>> Your checkin has not passed review because (see checked entries): >>> >>> ___ Your RR template is generally incomplete; it has too many blank >> entries >>> that need proper data filled in. >>> >>> ___ You have failed to nominate the proper persons for review and >> push. >>> ___ Your patches do not have proper short+long header >>> >>> ___ You have grammar/spelling in your header that is unacceptable. >>> >>> ___ You have exceeded a sensible line length in your >> headers/comments/text. >>> ___ You have failed to put in a proper Trac Ticket # into your >> commits. >>> ___ You have incorrectly put/left internal data in your >> comments/files >>> (i.e. internal bug tracking tool IDs, product names etc) >>> >>> ___ You have not given any evidence of testing beyond basic build >> tests. >>> Demonstrate some level of runtime or other sanity testing. >>> >>> ___ You have ^M present in some of your files. These have to be >> removed. >>> ___ You have needlessly changed whitespace or added whitespace >> crimes >>> like trailing spaces, or spaces before tabs. >>> >>> ___ You have mixed real technical changes with whitespace and other >>> cosmetic code cleanup changes. These have to be separate >> commits. >>> ___ You need to refactor your submission into logical chunks; there >> is >>> too much content into a single commit. >>> >>> ___ You have extraneous garbage in your review (merge commits etc) >>> >>> ___ You have giant attachments which should never have been sent; >>> Instead you should place your content in a public tree to be >> pulled. >>> ___ You have too many commits attached to an e-mail; resend as >> threaded >>> commits, or place in a public tree for a pull. >>> >>> ___ You have resent this content multiple times without a clear >> indication >>> of what has changed between each re-send. >>> >>> ___ You have failed to adequately and individually address all of >> the >>> comments and change requests that were proposed in the initial >> review. >>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email >> etc) >>> ___ Your computer have a badly configured date and time; confusing >> the >>> the threaded patch review. >>> >>> ___ Your changes affect IPC mechanism, and you don't present any >> results >>> for in-service upgradability test. >>> >>> ___ Your changes affect user manual and documentation, your patch >> series >>> do not contain the patch that updates the Doxygen manual. >>> ------------------------------------------------------------------------------ Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel