From: Anand Sundararaj <s.an...@gethighavailability.com> Summary: amf: implement node repair admin command [#3204] V2 Review request for Ticket(s): 3204 Peer Reviewer(s): Minh, Thang, Nagendra, Paul Pull request to: Amf Maintainers Affected branch(es): develop Development branch: ticket-3204 Base revision: 59ded7cdf6a431e522229afd5ecb989e4a61c7d8 Personal repository: git://git.code.sf.net/u/s-anand-has/review
-------------------------------- Impacted area Impact y/n -------------------------------- Docs n Build system n RPM/packaging n Configuration files n Startup scripts n SAF services y OpenSAF services n Core libraries n Samples n Tests n Other n NOTE: Patch(es) contain lines longer than 80 characers Comments (indicate scope for each "y" above): --------------------------------------------- *** EXPLAIN/COMMENT THE PATCH SERIES HERE *** revision f0b8d7eff952f620adbceb68fdd21a53b416e572 Author: Anand Sundararaj <s.an...@gethighavailability.com> Date: Fri, 31 Jul 2020 02:15:04 +0530 amf: implement node repair admin command [#3204] V2 Complete diffstat: ------------------ src/amf/amfd/chkop.cc | 9 +++++++ src/amf/amfd/ckpt.h | 3 ++- src/amf/amfd/ckpt_dec.cc | 49 ++++++++++++++++++++++++++++++++++++- src/amf/amfd/ckpt_enc.cc | 38 ++++++++++++++++++++++++++++- src/amf/amfd/ckpt_msg.h | 1 + src/amf/amfd/ckpt_updt.cc | 1 + src/amf/amfd/clm.cc | 3 ++- src/amf/amfd/node.cc | 56 +++++++++++++++++++++++++++++++++++++++++- src/amf/amfd/node.h | 2 ++ src/amf/amfd/sgproc.cc | 25 +++++++++++++++++++ src/amf/amfd/util.cc | 1 + src/amf/amfnd/avnd_su.h | 1 + src/amf/amfnd/di.cc | 62 +++++++++++++++++++++++++++++++++++++++++++++++ src/amf/amfnd/err.cc | 2 +- src/amf/amfnd/su.cc | 2 +- 15 files changed, 248 insertions(+), 7 deletions(-) Testing Commands: ----------------- onfigure two demo appl(available in samples/amf/sa_aware)(App1 & App2) on SC-1 and PL-3. Configure PL-3 saAmfNodeAutoRepair as false Configure App2 demo appl saAmfCtDefRecoveryOnError as 5 (node failover) Unlock all 4 SUs: 2 running on SC-1(Std) and two running on PL-3 (Act) 1. Kill demo app App2 on PL-3. Node failover happens. SUs running on SC-1 becomes Act. osafamfd[10367]: NO NodeAutorepair disabled for 'safAmfNode=PL-3,safAmfCluster=myAmfCluster', no reboot ordered Node repair the node PL-3 using amf-adm repaired safAmfNode=PL-3,safAmfCluster=myAmfCluster PL-3 node state is enabled and 2 SUs runing on PL-3 get Standby assignment. 2. Repeat the test case just to see if all is well. 3. Repeat #1 before repair. Then delete both SUs running on PL-3. Then repair PL-3 Now add both the SUs and unlock them. They are given Std assignments. 4. Repeat test case #1 before repair When repair command is issued, hold the repair command at amfnd at PL-3 using gdb and reboot the machine. Repair command will return SA_AIS_ERR_REPAIR_PENDING amf-adm repaired safAmfNode=PL-3,safAmfCluster=myAmfCluster error - saImmOmAdminOperationInvoke_2 admin-op RETURNED: SA_AIS_ERR_REPAIR_PENDING (29) error-string: node failure When node starts again, it gets 2 SUs Standby assignments. 5. Repeat the test case before repair. Issue node lock/lock-in adn then issue repair, followed by unlock-in/unlock. 2 SUs on PL-3 gets Standby assignment 6. Repeat #5 for SU/SG/node-group/SI 7. Make the following changes for App1 and SUFailover as false: <attr> <name>saAmfSgtDefCompRestartProb</name> <value>40000000000</value> </attr> <attr> <name>saAmfSgtDefCompRestartMax</name> <value>2</value> </attr> <attr> <name>saAmfSgtDefSuRestartProb</name> <value>40000000000</value> </attr> <attr> <name>saAmfSgtDefSuRestartMax</name> <value>1</value> </attr> Kill demo component of App1, till node failover gets escalated. osafamfnd[11249]: NO SU failovers have reached configured limit of 2 osafamfnd[11249]: NO SU failover probation timer stopped osafamfnd[11249]: NO 'safComp=AmfDemo,safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1' recovery action escalated from 'componentRestart' to 'nodeFailover' osafamfnd[11249]: NO 'safComp=AmfDemo,safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1' faulted due to 'avaDown' : Recovery is 'nodeFailover' osafamfnd[11249]: NO Informing director of node fail-over Then repair the node. 2 SUs gets Std assignments. 8. Repeat all the above test cases with App2 demo appl saAmfCtDefRecoveryOnError as 4 (node switchover) Start 3 nodes (SC-1/SC-2/PL-3). Run appl on PL-3(node autorepair is disabled) with recovery as 5. TC #1: Kill demo on PL-3, it will go for node failover. Now repair the node. demo will come up again. TC #2: Kill demo on PL-3, it will go for node failover. Do controller si-swap. Do repair the node. demo will come up again. TC #3: Kill demo on PL-3, it will go for node failover. Do controller si-swap, lock clm node, unlock clm node. Do repair the node. demo will come up again. Upgrade testing: 1. Start SC-1 without the patch as Active. Start PL-3 with the patch. Start SC-2 with the patch. 2. SI-SWAP controller, so that SC-2(with the patch) becomes Act and SC-1 becomes Standby. 3. No stop SC-1 and start SC-1 with the patch. SC-1 becomes Standby. Now perform controller si-swap. Repeat all the above test cases from TC #1 to TC #3 Everything works fine. Testing, Expected Results: -------------------------- After node repair command, all eligible SUs will get Standy assignments. Conditions of Submission: ------------------------- Ack from any one of the reviewers Arch Built Started Linux distro ------------------------------------------- mips n n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: ------------------- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel