From: Anand Sundararaj <s.an...@gethighavailability.com>

Summary: amf: implement node repair admin command [#3204] V2
Review request for Ticket(s): 3204
Peer Reviewer(s): Minh, Thang, Nagendra, Paul
Pull request to: Amf Maintainers 
Affected branch(es): develop
Development branch: ticket-3204
Base revision: 59ded7cdf6a431e522229afd5ecb989e4a61c7d8
Personal repository: git://git.code.sf.net/u/s-anand-has/review

--------------------------------
Impacted area       Impact y/n
--------------------------------
 Docs                    n
 Build system            n
 RPM/packaging           n
 Configuration files     n
 Startup scripts         n
 SAF services            y
 OpenSAF services        n
 Core libraries          n
 Samples                 n
 Tests                   n
 Other                   n

NOTE: Patch(es) contain lines longer than 80 characers

Comments (indicate scope for each "y" above):
---------------------------------------------
*** EXPLAIN/COMMENT THE PATCH SERIES HERE ***

revision f0b8d7eff952f620adbceb68fdd21a53b416e572
Author: Anand Sundararaj <s.an...@gethighavailability.com>
Date:   Fri, 31 Jul 2020 02:15:04 +0530

amf: implement node repair admin command [#3204] V2



Complete diffstat:
------------------
 src/amf/amfd/chkop.cc     |  9 +++++++
 src/amf/amfd/ckpt.h       |  3 ++-
 src/amf/amfd/ckpt_dec.cc  | 49 ++++++++++++++++++++++++++++++++++++-
 src/amf/amfd/ckpt_enc.cc  | 38 ++++++++++++++++++++++++++++-
 src/amf/amfd/ckpt_msg.h   |  1 +
 src/amf/amfd/ckpt_updt.cc |  1 +
 src/amf/amfd/clm.cc       |  3 ++-
 src/amf/amfd/node.cc      | 56 +++++++++++++++++++++++++++++++++++++++++-
 src/amf/amfd/node.h       |  2 ++
 src/amf/amfd/sgproc.cc    | 25 +++++++++++++++++++
 src/amf/amfd/util.cc      |  1 +
 src/amf/amfnd/avnd_su.h   |  1 +
 src/amf/amfnd/di.cc       | 62 +++++++++++++++++++++++++++++++++++++++++++++++
 src/amf/amfnd/err.cc      |  2 +-
 src/amf/amfnd/su.cc       |  2 +-
 15 files changed, 248 insertions(+), 7 deletions(-)


Testing Commands:
-----------------
onfigure two demo appl(available in samples/amf/sa_aware)(App1 & App2) on SC-1 
and PL-3.
Configure PL-3 saAmfNodeAutoRepair as false Configure App2 demo appl 
saAmfCtDefRecoveryOnError as 5 (node failover) Unlock all 4 SUs: 2 running on 
SC-1(Std) and two running on PL-3 (Act) 1. Kill demo app App2 on PL-3. Node 
failover happens. SUs running on SC-1 becomes Act.
osafamfd[10367]: NO NodeAutorepair disabled for 
'safAmfNode=PL-3,safAmfCluster=myAmfCluster', no reboot ordered

Node repair the node PL-3 using
amf-adm repaired safAmfNode=PL-3,safAmfCluster=myAmfCluster

PL-3 node state is enabled and 2 SUs runing on PL-3 get Standby assignment.

2. Repeat the test case just to see if all is well.
3. Repeat #1 before repair. Then delete both SUs running on PL-3. Then repair 
PL-3
   Now add both the SUs and unlock them. They are given Std assignments.
4. Repeat test case #1 before repair
When repair command is issued, hold the repair command at amfnd at PL-3 using 
gdb and reboot the machine.
Repair command will return SA_AIS_ERR_REPAIR_PENDING

amf-adm repaired safAmfNode=PL-3,safAmfCluster=myAmfCluster
error - saImmOmAdminOperationInvoke_2 admin-op RETURNED: 
SA_AIS_ERR_REPAIR_PENDING (29)
error-string: node failure

When node starts again, it gets 2 SUs Standby assignments.

5. Repeat the test case before repair.
   Issue node lock/lock-in adn then issue repair, followed by unlock-in/unlock. 
2 SUs on PL-3 gets Standby assignment 6. Repeat #5 for SU/SG/node-group/SI 7. 
Make the following changes for App1 and SUFailover as false:
                <attr>
                        <name>saAmfSgtDefCompRestartProb</name>
                        <value>40000000000</value>
                </attr>
                <attr>
                        <name>saAmfSgtDefCompRestartMax</name>
                        <value>2</value>
                </attr>
                <attr>
                        <name>saAmfSgtDefSuRestartProb</name>
                        <value>40000000000</value>
                </attr>
                <attr>
                        <name>saAmfSgtDefSuRestartMax</name>
                        <value>1</value>
                </attr>

 Kill demo component of App1, till node failover gets escalated.
osafamfnd[11249]: NO SU failovers have reached configured limit of 2
osafamfnd[11249]: NO SU failover probation timer stopped
osafamfnd[11249]: NO 'safComp=AmfDemo,safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1' 
recovery action escalated from 'componentRestart' to 'nodeFailover'
osafamfnd[11249]: NO 'safComp=AmfDemo,safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1' 
faulted due to 'avaDown' : Recovery is 'nodeFailover'
osafamfnd[11249]: NO Informing director of node fail-over Then repair the node. 
2 SUs gets Std assignments.

8. Repeat all the above test cases with App2 demo appl 
saAmfCtDefRecoveryOnError as 4 (node switchover)

Start 3 nodes (SC-1/SC-2/PL-3). Run appl on PL-3(node autorepair is disabled) 
with recovery as 5.
TC #1: Kill demo on PL-3, it will go for node failover. Now repair the node. 
demo will come up again.
TC #2: Kill demo on PL-3, it will go for node failover. Do controller si-swap. 
Do repair the node. demo will come up again.
TC #3: Kill demo on PL-3, it will go for node failover. Do controller si-swap, 
lock clm node, unlock clm node. Do repair the node. demo will come up again.

Upgrade testing:
1. Start SC-1 without the patch as Active. Start PL-3 with the patch. Start 
SC-2 with the patch.
2. SI-SWAP controller, so that SC-2(with the patch) becomes Act and SC-1 
becomes Standby.
3. No stop SC-1 and start SC-1 with the patch. SC-1 becomes Standby. Now 
perform controller si-swap.
Repeat all the above test cases from TC #1 to TC #3 Everything works fine.


Testing, Expected Results:
--------------------------
After node repair command, all eligible SUs will get Standy assignments.

Conditions of Submission:
-------------------------
Ack from any one of the reviewers

Arch      Built     Started    Linux distro
-------------------------------------------
mips        n          n
mips64      n          n
x86         n          n
x86_64      y          y
powerpc     n          n
powerpc64   n          n


Reviewer Checklist:
-------------------
[Submitters: make sure that your review doesn't trigger any checkmarks!]


Your checkin has not passed review because (see checked entries):

___ Your RR template is generally incomplete; it has too many blank entries
    that need proper data filled in.

___ You have failed to nominate the proper persons for review and push.

___ Your patches do not have proper short+long header

___ You have grammar/spelling in your header that is unacceptable.

___ You have exceeded a sensible line length in your headers/comments/text.

___ You have failed to put in a proper Trac Ticket # into your commits.

___ You have incorrectly put/left internal data in your comments/files
    (i.e. internal bug tracking tool IDs, product names etc)

___ You have not given any evidence of testing beyond basic build tests.
    Demonstrate some level of runtime or other sanity testing.

___ You have ^M present in some of your files. These have to be removed.

___ You have needlessly changed whitespace or added whitespace crimes
    like trailing spaces, or spaces before tabs.

___ You have mixed real technical changes with whitespace and other
    cosmetic code cleanup changes. These have to be separate commits.

___ You need to refactor your submission into logical chunks; there is
    too much content into a single commit.

___ You have extraneous garbage in your review (merge commits etc)

___ You have giant attachments which should never have been sent;
    Instead you should place your content in a public tree to be pulled.

___ You have too many commits attached to an e-mail; resend as threaded
    commits, or place in a public tree for a pull.

___ You have resent this content multiple times without a clear indication
    of what has changed between each re-send.

___ You have failed to adequately and individually address all of the
    comments and change requests that were proposed in the initial review.

___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc)

___ Your computer have a badly configured date and time; confusing the
    the threaded patch review.

___ Your changes affect IPC mechanism, and you don't present any results
    for in-service upgradability test.

___ Your changes affect user manual and documentation, your patch series
    do not contain the patch that updates the Doxygen manual.



_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to