Summary: amfd: add support for delaying node failover [#2918]
Review request for Ticket(s): 2918
Peer Reviewer(s): Hans, Minh, Nagu 
Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE ***
Affected branch(es): develop
Development branch: ticket-2918
Base revision: 3b80698770d599bc15b97119cbfd4098943d7643
Personal repository: git://git.code.sf.net/u/userid-2226215/review

--------------------------------
Impacted area       Impact y/n
--------------------------------
 Docs                    n
 Build system            n
 RPM/packaging           n
 Configuration files     n
 Startup scripts         n
 SAF services            y 
 OpenSAF services        n 
 Core libraries          n
 Samples                 n
 Tests                   n
 Other                   n


Comments (indicate scope for each "y" above):
---------------------------------------------

Please see ticket for more details and a state diagram is available there.

revision 7e04f9bc5aea4f5580e3bdf0551b37c05bfc4025
Author: Gary Lee <gary....@dektech.com.au>
Date:   Wed, 24 Oct 2018 11:37:04 +0000

amfd: add support for delaying node failover [#2918]

OpenSAF has relied on reliable, redundant links between nodes in a cluster.
This can no longer be assumed in virtualised environments.

In order to avoid duplicate assignments, we need to delay
node failover in environments where temporary network partitioning is expected.

When delayed node failover is enabled, AMF will not perform a node
failover until a node has been fenced if remote fencing is available,
or until the specified period has occurred (osafAmfDelayNodeFailoverTimeout).

If MDS connectivity is re-established while waiting, AMF will wait
osafAmfDelayNodeFailoverNodeUpWait seconds for a node
up (with leds_set == false) message to indicate the node
has been already rebooted, and finish the node failover.

Otherwise, AMF will send a message to the node
asking it to reboot itself. When AMF sees that the MDS connectivity is
lost again, or after osafAmfDelayNodeFailoverNodeUpWait seconds,
it can consider the fencing to be complete and finish the node failover.



revision 184835903e2c0d4544c69b2348d7095afb91219f
Author: Gary Lee <gary....@dektech.com.au>
Date:   Wed, 24 Oct 2018 11:37:04 +0000

amfd: add checkpointing of node failover state [#2918]



revision 7052963a7b555d256c2674aee0cfa2cb2497dd68
Author: Gary Lee <gary....@dektech.com.au>
Date:   Wed, 24 Oct 2018 11:37:04 +0000

amfnd: allow reboot from any director [#2918]

allow reboot msg to be sent from any director, for
split brain recovery situations



revision 7aeb96aebae4dec85b59a83e0755337ff6be3c28
Author: Gary Lee <gary....@dektech.com.au>
Date:   Wed, 24 Oct 2018 11:36:56 +0000

amfd: add class definitions for new timers [#2918]

osafAmfDelayNodeFailoverTimeout - the number of seconds we wait
after MDS down is received before we consider it truly down.

osafAmfDelayNodeFailoverNodeUpWait - the number of seconds we
wait for Node Up after receving MDS up, before we send reboot
to the node. After sending  reboot to a node, also wait up to
this number of seconds before we consider the node to be
down (unless MDs down is received first).



Added Files:
------------
 src/amf/amfd/node_state.cc
 src/amf/amfd/node_state.h
 src/amf/amfd/node_state_machine.cc
 src/amf/amfd/node_state_machine.h


Complete diffstat:
------------------
 src/amf/Makefile.am                |   6 +
 src/amf/amfd/cb.h                  |  24 ++-
 src/amf/amfd/chkop.cc              |  10 ++
 src/amf/amfd/ckpt.h                |   3 +-
 src/amf/amfd/ckpt_dec.cc           |  40 ++++-
 src/amf/amfd/ckpt_enc.cc           |  26 ++-
 src/amf/amfd/ckpt_msg.h            |   1 +
 src/amf/amfd/clm.cc                |  12 +-
 src/amf/amfd/cluster.cc            |  18 ++
 src/amf/amfd/cluster.h             |   1 +
 src/amf/amfd/config.cc             |  35 +++-
 src/amf/amfd/evt.h                 |   1 +
 src/amf/amfd/main.cc               |  13 +-
 src/amf/amfd/ndfsm.cc              |  70 ++++++--
 src/amf/amfd/ndproc.cc             |  14 +-
 src/amf/amfd/node.cc               |   2 +
 src/amf/amfd/node_state.cc         | 338 +++++++++++++++++++++++++++++++++++++
 src/amf/amfd/node_state.h          | 101 +++++++++++
 src/amf/amfd/node_state_machine.cc |  98 +++++++++++
 src/amf/amfd/node_state_machine.h  |  39 +++++
 src/amf/amfd/proc.h                |   2 +-
 src/amf/amfd/role.cc               |   9 +-
 src/amf/amfd/timer.cc              |   6 +-
 src/amf/amfd/timer.h               |   1 +
 src/amf/amfnd/mds.cc               |   3 +-
 src/amf/config/amf_classes.xml     |  14 +-
 src/amf/config/amf_objects.xml     |   8 +
 27 files changed, 860 insertions(+), 35 deletions(-)


Testing Commands:
-----------------

Test Case 1:

0. Set 'osafAmfDelayNodeFailoverTimeout' to 15s
1. 2N app on PL3 (active) and PL4 (standby)
2. Reboot PL3 (assuming it comes back within 15s)
3. Ensure PL4 is only assigned active after PL3 is up

Test Case 2:

1. NwayActive app on PL3, PL4 and PL5
2. Isolate PL3 from the rest of network
3. Remove isolation
4. Ensure PL3 is rebooted by AMF

Test Case 3:

1. NoRed app hosted on PL3 (active) and PL4, single SI only
2. Isolate PL3 longer than 'osafAmfDelayNodeFailoverTimeout'
3. Check PL4 is assigned active after timer expiry
4. Remove isolation
5. Check PL3 is rebooted

Test Case 4:

1. 2N app on PL3 (active) and PL4 (standby)
2. Isolate PL3
3. Reboot active SC
4. Remove isolation before 'osafAmfDelayNodeFailoverTimeout'
5. Check PL3 is rebooted
6. Check PL4 is assigned active after PL3 is rebooted
7. Check PL3 is assigned standby

Test Case 5:

1. NwayActive on PL3, PL4 and PL5 (saAmfSIPrefActiveAssignments=2)
2. Check PL3 and PL4 are active
3. Isolate PL3
4. Trigger SC switchover with si-swap safSi=SC-2N,safApp=OpenSAF
5. Verify PL5 is assigned active after timer expiry
6. Remove isolation
7. Check PL3 is rebooted


Testing, Expected Results:
--------------------------
See above

Conditions of Submission:
-------------------------
Ack from any reviewer, or in 7 days

Arch      Built     Started    Linux distro
-------------------------------------------
mips        n          n
mips64      n          n
x86         n          n
x86_64      y          y 
powerpc     n          n
powerpc64   n          n


Reviewer Checklist:
-------------------
[Submitters: make sure that your review doesn't trigger any checkmarks!]


Your checkin has not passed review because (see checked entries):

___ Your RR template is generally incomplete; it has too many blank entries
    that need proper data filled in.

___ You have failed to nominate the proper persons for review and push.

___ Your patches do not have proper short+long header

___ You have grammar/spelling in your header that is unacceptable.

___ You have exceeded a sensible line length in your headers/comments/text.

___ You have failed to put in a proper Trac Ticket # into your commits.

___ You have incorrectly put/left internal data in your comments/files
    (i.e. internal bug tracking tool IDs, product names etc)

___ You have not given any evidence of testing beyond basic build tests.
    Demonstrate some level of runtime or other sanity testing.

___ You have ^M present in some of your files. These have to be removed.

___ You have needlessly changed whitespace or added whitespace crimes
    like trailing spaces, or spaces before tabs.

___ You have mixed real technical changes with whitespace and other
    cosmetic code cleanup changes. These have to be separate commits.

___ You need to refactor your submission into logical chunks; there is
    too much content into a single commit.

___ You have extraneous garbage in your review (merge commits etc)

___ You have giant attachments which should never have been sent;
    Instead you should place your content in a public tree to be pulled.

___ You have too many commits attached to an e-mail; resend as threaded
    commits, or place in a public tree for a pull.

___ You have resent this content multiple times without a clear indication
    of what has changed between each re-send.

___ You have failed to adequately and individually address all of the
    comments and change requests that were proposed in the initial review.

___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc)

___ Your computer have a badly configured date and time; confusing the
    the threaded patch review.

___ Your changes affect IPC mechanism, and you don't present any results
    for in-service upgradability test.

___ Your changes affect user manual and documentation, your patch series
    do not contain the patch that updates the Doxygen manual.



_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to