Summary: amfd: add support for delaying node failover [#2918] Review request for Ticket(s): 2918 Peer Reviewer(s): Hans, Minh, Nagu Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-2918 Base revision: 3b80698770d599bc15b97119cbfd4098943d7643 Personal repository: git://git.code.sf.net/u/userid-2226215/review
-------------------------------- Impacted area Impact y/n -------------------------------- Docs n Build system n RPM/packaging n Configuration files n Startup scripts n SAF services y OpenSAF services n Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): --------------------------------------------- Please see ticket for more details and a state diagram is available there. revision 7e04f9bc5aea4f5580e3bdf0551b37c05bfc4025 Author: Gary Lee <gary....@dektech.com.au> Date: Wed, 24 Oct 2018 11:37:04 +0000 amfd: add support for delaying node failover [#2918] OpenSAF has relied on reliable, redundant links between nodes in a cluster. This can no longer be assumed in virtualised environments. In order to avoid duplicate assignments, we need to delay node failover in environments where temporary network partitioning is expected. When delayed node failover is enabled, AMF will not perform a node failover until a node has been fenced if remote fencing is available, or until the specified period has occurred (osafAmfDelayNodeFailoverTimeout). If MDS connectivity is re-established while waiting, AMF will wait osafAmfDelayNodeFailoverNodeUpWait seconds for a node up (with leds_set == false) message to indicate the node has been already rebooted, and finish the node failover. Otherwise, AMF will send a message to the node asking it to reboot itself. When AMF sees that the MDS connectivity is lost again, or after osafAmfDelayNodeFailoverNodeUpWait seconds, it can consider the fencing to be complete and finish the node failover. revision 184835903e2c0d4544c69b2348d7095afb91219f Author: Gary Lee <gary....@dektech.com.au> Date: Wed, 24 Oct 2018 11:37:04 +0000 amfd: add checkpointing of node failover state [#2918] revision 7052963a7b555d256c2674aee0cfa2cb2497dd68 Author: Gary Lee <gary....@dektech.com.au> Date: Wed, 24 Oct 2018 11:37:04 +0000 amfnd: allow reboot from any director [#2918] allow reboot msg to be sent from any director, for split brain recovery situations revision 7aeb96aebae4dec85b59a83e0755337ff6be3c28 Author: Gary Lee <gary....@dektech.com.au> Date: Wed, 24 Oct 2018 11:36:56 +0000 amfd: add class definitions for new timers [#2918] osafAmfDelayNodeFailoverTimeout - the number of seconds we wait after MDS down is received before we consider it truly down. osafAmfDelayNodeFailoverNodeUpWait - the number of seconds we wait for Node Up after receving MDS up, before we send reboot to the node. After sending reboot to a node, also wait up to this number of seconds before we consider the node to be down (unless MDs down is received first). Added Files: ------------ src/amf/amfd/node_state.cc src/amf/amfd/node_state.h src/amf/amfd/node_state_machine.cc src/amf/amfd/node_state_machine.h Complete diffstat: ------------------ src/amf/Makefile.am | 6 + src/amf/amfd/cb.h | 24 ++- src/amf/amfd/chkop.cc | 10 ++ src/amf/amfd/ckpt.h | 3 +- src/amf/amfd/ckpt_dec.cc | 40 ++++- src/amf/amfd/ckpt_enc.cc | 26 ++- src/amf/amfd/ckpt_msg.h | 1 + src/amf/amfd/clm.cc | 12 +- src/amf/amfd/cluster.cc | 18 ++ src/amf/amfd/cluster.h | 1 + src/amf/amfd/config.cc | 35 +++- src/amf/amfd/evt.h | 1 + src/amf/amfd/main.cc | 13 +- src/amf/amfd/ndfsm.cc | 70 ++++++-- src/amf/amfd/ndproc.cc | 14 +- src/amf/amfd/node.cc | 2 + src/amf/amfd/node_state.cc | 338 +++++++++++++++++++++++++++++++++++++ src/amf/amfd/node_state.h | 101 +++++++++++ src/amf/amfd/node_state_machine.cc | 98 +++++++++++ src/amf/amfd/node_state_machine.h | 39 +++++ src/amf/amfd/proc.h | 2 +- src/amf/amfd/role.cc | 9 +- src/amf/amfd/timer.cc | 6 +- src/amf/amfd/timer.h | 1 + src/amf/amfnd/mds.cc | 3 +- src/amf/config/amf_classes.xml | 14 +- src/amf/config/amf_objects.xml | 8 + 27 files changed, 860 insertions(+), 35 deletions(-) Testing Commands: ----------------- Test Case 1: 0. Set 'osafAmfDelayNodeFailoverTimeout' to 15s 1. 2N app on PL3 (active) and PL4 (standby) 2. Reboot PL3 (assuming it comes back within 15s) 3. Ensure PL4 is only assigned active after PL3 is up Test Case 2: 1. NwayActive app on PL3, PL4 and PL5 2. Isolate PL3 from the rest of network 3. Remove isolation 4. Ensure PL3 is rebooted by AMF Test Case 3: 1. NoRed app hosted on PL3 (active) and PL4, single SI only 2. Isolate PL3 longer than 'osafAmfDelayNodeFailoverTimeout' 3. Check PL4 is assigned active after timer expiry 4. Remove isolation 5. Check PL3 is rebooted Test Case 4: 1. 2N app on PL3 (active) and PL4 (standby) 2. Isolate PL3 3. Reboot active SC 4. Remove isolation before 'osafAmfDelayNodeFailoverTimeout' 5. Check PL3 is rebooted 6. Check PL4 is assigned active after PL3 is rebooted 7. Check PL3 is assigned standby Test Case 5: 1. NwayActive on PL3, PL4 and PL5 (saAmfSIPrefActiveAssignments=2) 2. Check PL3 and PL4 are active 3. Isolate PL3 4. Trigger SC switchover with si-swap safSi=SC-2N,safApp=OpenSAF 5. Verify PL5 is assigned active after timer expiry 6. Remove isolation 7. Check PL3 is rebooted Testing, Expected Results: -------------------------- See above Conditions of Submission: ------------------------- Ack from any reviewer, or in 7 days Arch Built Started Linux distro ------------------------------------------- mips n n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: ------------------- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel