Re: [devel] [PATCH 0 of 2] Review Request for fm: failover using OPENSAF_MANAGE_TIPC flag and failover after down of critical services [#721]

Anders Widell Tue, 04 Feb 2014 04:40:24 -0800

Ok, now I see what you mean. In the case of failover without OS reboot, 
the FM process is terminated and will not perform the waiting.


/ Anders Widell

2014-02-04 12:56, Mathivanan Naickan Palanivelu skrev:
> ----- anders.wid...@ericsson.com wrote:
>
>> I am not sure if I fully understand the use case. You say that this is
>>
>> for failover without OS reboot, but at the same time you say that FM
>> will stay alive until it is killed by the OS reboot, and the peer FM
>> will not take over until it receives a service down of FM. Could you
>> elaborate a bit on this?
>>
> Support for failover without OS reboot would mean support for
> failover by means of /etc/init.d/opensafd 'stop' without rmmod tipc
> (i.e. The OS is up, non-OpenSAF applications are on and using TIPC).
> Now, a mechanism that supports this scenario should also handle the
> case when AMFND 'crashes' and node reboot got trigerred.
>
> In both the cases, the peer FM will failover only after downs of
> all - AMFD, AMFND, IMMD, IMMND, FM are received.
>
> - In the 'stop' case, all processes are terminated by AMFND first. 
> Subsequently
> AMFND and AMFD exit last.
> - In the 'amfnd crash' case, all middleware processes terminate themselves 
> upon detecting
> AMFND crash. But local FM will not exit upon detecting AMFND crash because
> there is no guarantee that application processes have exited at that point of
> time and therefore the local FM will 'wait' until the OS terminates it.
>
> It is for this 'waiting' period, that the proposal to run a system_fencer 
> script
> on systems with delayed(by design or fault) reboots, has been made in a 
> separate patch
> for TLC approval.
>
>
> Cheers,
> Mathi.
>
>> thanks
>> / Anders Widell
>>
>> 2014-01-29 19:00, mathi.naic...@oracle.com skrev:
>>> Summary: failover using OPENSAF_MANAGE_TIPC flag and failover after
>> down of critical services [#721]
>>> Review request for Trac Ticket(s): #721
>>> Peer Reviewer(s): ramesh.bet...@oracle.com,
>> anders.wid...@ericsson.com
>>> Pull request to: <<LIST THE PERSON WITH PUSH ACCESS HERE>>
>>> Affected branch(es): opensaf-4.4.x, default
>>> Development branch: <<IF ANY GIVE THE REPO URL>>
>>>
>>> --------------------------------
>>> Impacted area       Impact y/n
>>> --------------------------------
>>>    Docs                    n
>>>    Build system            n
>>>    RPM/packaging           n
>>>    Configuration files     n
>>>    Startup scripts         n
>>>    SAF services            n
>>>    OpenSAF services        y
>>>    Core libraries          n
>>>    Samples                 n
>>>    Tests                   n
>>>    Other                   n
>>>
>>>
>>> Comments (indicate scope for each "y" above):
>>> ---------------------------------------------
>>> Two cases are supported by this patch
>>>
>>> 1) Failover in 4.4.x, and default will be the same as previous
>> releases.
>>> i.e. as usual FM trigerres failover upon receiving the NODE_DOWN
>> event.
>>> 2) The main goal of the patch is w.r.t failover without OS reboot
>> through
>>> /etc/init.d/opensafd stop is more controlled now in the scenario
>> "when amfnd itself crashes".
>>> FM trigers failover after DOWNs of AMF, IMM and FM.
>>>
>>> More info of these cases:
>>> 1) The flag OPENSAF_MANAGE_TIPC=yes is used to control when failover
>> is trigerred.
>>> This way, the default failover behaviour will now be the same as the
>> previous releases.
>>> There is no change involved.
>>>
>>> 2) In the usecase of failover (involving /etc/init.d/opensafd stop)
>> without OS reboot cycle,
>>> the flag nid.conf is set to OPENSAF_MANAGE_TIPC=no. This usecase is
>> currently
>>> intact, however some considerations need to be made ito handle the
>> scenrio when amfnd itself crashes.
>>> For this FM shall subscribe to service downs of AMFD, AMFND, IMMD,
>> IMMND AND FM and "FM
>>> by way of installing amfnd_down_callback shall exit only when the OS
>> reboot terminates FM".
>>> The peer FM will the failover once downs of all these services are
>> received.
>>> changeset c1875b7073b5fa2a1b9ae8755a15f6e8a6bf1aaf
>>> Author:     Mathivanan N.P.<mathi.naic...@oracle.com>
>>> Date:       Wed, 29 Jan 2014 23:08:05 +0530
>>>
>>>     fm: failover using OPENSAF_MANAGE_TIPC flag and subscribe to AMF,
>> IMM downs
>>>     [#721] To support failover without OS reboot, FM subscribed to
>> AMFND down
>>>     events. But this may not be sufficient in scenarios when AMFND
>> itself
>>>     crashes or exits. In this scenario, the exit/kill of OpenSAF
>> processes need
>>>     not be in order and immediate. This can create a scenario where
>> some OpenSAF
>>>     process may still be running, but FM has already started failover
>>>     processing. To avoid this, FM subscribes to the down events of
>> critical
>>>     opensaf services - AMFD, AMFND, IMMD, IMMND In a false/quick
>> failover
>>>     scenario, these are the services that can lead to problems like
>>>     implementerset not cleared or dangling AMF state assignments.
>> Currently, all
>>>     OpenSAF services exit upon receiving the AMFND down events, but
>> even this
>>>     does not guarantee immediate and ordered delivery of down events.
>> The next
>>>     patch in this series of 2 patches makes FM to install
>>>     ava_amfnd_down_callback such that FM waits forever till the OS
>> kills FM.
>>> changeset 77232a084d58b68c8b470f7b90fa6a2df541183d
>>> Author:     Mathivanan N.P.<mathi.naic...@oracle.com>
>>> Date:       Wed, 29 Jan 2014 23:13:27 +0530
>>>
>>>     fm: install ava_install_amf_down_cb and wait till OS reboot
>> terminates FM
>>>     [#721] FM installs amfnd down callback and upon receiving the amfnd
>> down
>>>     event, shall wait till the OS reboot(trigerred by amfnd crash or
>> exit)
>>>     terminates FM. The STANDBY FM will now start the failover after
>> recieving
>>>     the downs of all AMFD, AMFND, IMMD, IMMND and FM
>>>
>>>
>>> Complete diffstat:
>>> ------------------
>>>    osaf/services/infrastructure/fm/fms/fm.h         |    2 +
>>>    osaf/services/infrastructure/fm/fms/fm_amf.c     |   18
>> +++++++++++
>>>    osaf/services/infrastructure/fm/fms/fm_cb.h      |    8 +++++
>>>    osaf/services/infrastructure/fm/fms/fm_evt.h     |    2 +
>>>    osaf/services/infrastructure/fm/fms/fm_main.c    |   86
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>>    osaf/services/infrastructure/fm/fms/fm_mds.c     |  155
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------
>>>    osaf/services/infrastructure/fm/fms/fm_mds.h     |    1 +
>>>    osaf/services/infrastructure/nid/config/nid.conf |    2 +-
>>>    8 files changed, 222 insertions(+), 52 deletions(-)
>>>
>>>
>>> Testing Commands:
>>> -----------------
>>> 1) Start a opensaf cluster. Trigerr failover by all known means
>>> i.e. /etc/init.d/opensafd stop or kill or director components
>>> or of amnfnd.
>>>
>>> 2) edit nid.conf and set  OPENSAF_MANAGE_TIPC=no.
>>> /usr/lib64/opensaf/configure_tipc start eth0 1234
>>> Start opensaf.
>>> Trigger failover by all known means.
>>>
>>> Testing, Expected Results:
>>> --------------------------
>>> Failover should work.
>>>
>>> Conditions of Submission:
>>> -------------------------
>>> Ack from Ramesh and/or AndersW
>>>
>>> Arch      Built     Started    Linux distro
>>> -------------------------------------------
>>> mips        n          n
>>> mips64      n          n
>>> x86         n          n
>>> x86_64      y          y
>>> powerpc     n          n
>>> powerpc64   n          n
>>>
>>>
>>> Reviewer Checklist:
>>> -------------------
>>> [Submitters: make sure that your review doesn't trigger any
>> checkmarks!]
>>>
>>> Your checkin has not passed review because (see checked entries):
>>>
>>> ___ Your RR template is generally incomplete; it has too many blank
>> entries
>>>       that need proper data filled in.
>>>
>>> ___ You have failed to nominate the proper persons for review and
>> push.
>>> ___ Your patches do not have proper short+long header
>>>
>>> ___ You have grammar/spelling in your header that is unacceptable.
>>>
>>> ___ You have exceeded a sensible line length in your
>> headers/comments/text.
>>> ___ You have failed to put in a proper Trac Ticket # into your
>> commits.
>>> ___ You have incorrectly put/left internal data in your
>> comments/files
>>>       (i.e. internal bug tracking tool IDs, product names etc)
>>>
>>> ___ You have not given any evidence of testing beyond basic build
>> tests.
>>>       Demonstrate some level of runtime or other sanity testing.
>>>
>>> ___ You have ^M present in some of your files. These have to be
>> removed.
>>> ___ You have needlessly changed whitespace or added whitespace
>> crimes
>>>       like trailing spaces, or spaces before tabs.
>>>
>>> ___ You have mixed real technical changes with whitespace and other
>>>       cosmetic code cleanup changes. These have to be separate
>> commits.
>>> ___ You need to refactor your submission into logical chunks; there
>> is
>>>       too much content into a single commit.
>>>
>>> ___ You have extraneous garbage in your review (merge commits etc)
>>>
>>> ___ You have giant attachments which should never have been sent;
>>>       Instead you should place your content in a public tree to be
>> pulled.
>>> ___ You have too many commits attached to an e-mail; resend as
>> threaded
>>>       commits, or place in a public tree for a pull.
>>>
>>> ___ You have resent this content multiple times without a clear
>> indication
>>>       of what has changed between each re-send.
>>>
>>> ___ You have failed to adequately and individually address all of
>> the
>>>       comments and change requests that were proposed in the initial
>> review.
>>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email
>> etc)
>>> ___ Your computer have a badly configured date and time; confusing
>> the
>>>       the threaded patch review.
>>>
>>> ___ Your changes affect IPC mechanism, and you don't present any
>> results
>>>       for in-service upgradability test.
>>>
>>> ___ Your changes affect user manual and documentation, your patch
>> series
>>>       do not contain the patch that updates the Doxygen manual.
>>>


------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 0 of 2] Review Request for fm: failover using OPENSAF_MANAGE_TIPC flag and failover after down of critical services [#721]

Reply via email to