Summary: AMF: Recover transient SUSIs from headless (admin continuation, node restart) [#1725] V3 Review request for Trac Ticket(s): 1725 Peer Reviewer(s): AMF devs Pull request to: <<LIST THE PERSON WITH PUSH ACCESS HERE>> Affected branch(es): 5.0, default Development branch: default
-------------------------------- Impacted area Impact y/n -------------------------------- Docs n Build system n RPM/packaging n Configuration files n Startup scripts n SAF services y OpenSAF services n Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): --------------------------------------------- Additions in V3: - Add patch recover in case of node restart during headless - Add patch to validate cached RTA read from IMM changeset d921dfed678b396087c46cb3af1249e4f3f5b7ab Author: minh-chau <minh.c...@dektech.com.au> Date: Thu, 18 Aug 2016 09:56:19 +1000 AMFD: Introduce new RTA states for admin operation continuation after headless [#1725 part 1] V3 If there's an admin operation running and at that time cluster goes into headless stage, the normal admin operation sequence is interrupted. Since both SCs are down, the SI assignments at AMFND could be on going or completed during headless period. After headless this admin operation should be continued. This patch series supports the admin operation continuation after headless. To resume the admin operation after headless, the states need to be restored are: SUSI fsm states, SG fsm states, SI Dependency states (not suppported in this patch), SU Switch toggle, and SU operation list in SG at the time cluster goes headless. At this moment, the SG fsm states are set variously in each specific SG models. Also, the rule that a SU to be added in SG's operation list is not consistent. A SU is added to operation list after AMFD sends su_si_assign event on this SU in most of the places. However, there're are some scenarios that a SU is added to the list for other purposes (failover). These difficulties make the state deduction logic hard to implemenent. This patch introduces new RTA states: osafAmfSGSuOperationList, osafAmfSGFsmState, osafAmfSISUFsmState and osafAmfSUSwitch to capture the SU operation list of SG, SG fsm state, SUSI fsm state, and SU Switch of AMFD memory to IMM during AMFD lifetime. When cluster comes back from headless, these RTA will read from IMM to restore states in AMFD's memory. It also adds additional field in state_info (headless synchronization) message which indicates current SUSI fsm states. Both of SUSI fsm states help to validate the new RTA states read from IMM after headless. Example: if IMM SUSI fsm state is ASGN, synced SUSI fsm state is ASGND, then HA state must be ACTIVE or STANDBY. Such validation is indeed neccessary since headless interruption is unplanned and the recovery heavily depends on RTA read from IMM. changeset 30e3871ace1c014efab53e7428c33cc6ce4aece6 Author: minh-chau <minh.c...@dektech.com.au> Date: Thu, 18 Aug 2016 09:56:23 +1000 AMFND: Admin operation continuation if csi completes during headless [#1725 part 1] V1 There're two options basically that AMFD can continue admin operation wih completed csi(s) First: AMFD can use the sync SUSI fsm state as latest, AMFD then has to explore its SUSI assignments with adminStates of relevant entities to determine which SU should be on call of susi_success(). Deeper level of exploration for csi addition. It also depends on SG Fsm state which is being used variously in different SG types. Second: AMFD uses the SUSI fsm state read from IMM as latest, and AMFND needs to resend susi_resp messages which were deferred during headless so that AMFD can continue the admin operation sequence. Both cases of csi completion [during or after] headless can run in the same code flow. The patch buffers susi_resp_msg during headless stage and resend it to AMFD after headless. There could be a chance that AMFND sent out susi response message but AMFD could not receive or process it. This case could be seen as a defect, which can be fixed by securing the result of sending susi_resp message from AMFND toward AMFD. changeset 683d8522ee2175539f4aa63f2200513fcc6b0022 Author: minh-chau <minh.c...@dektech.com.au> Date: Thu, 18 Aug 2016 09:56:31 +1000 AMFD: Failover absent assignment due to node restart or powered off while headless [#1725 part 2] When a payload restarts or is powered off during headless, the SUSI assignments in this payload were removed, that shall break down the HA characteristic of SUSI assignments after headless. This patch treats the SUSI assignments removed during headless as ABSENT SUSI, and reuse node_fail() to perform a failover on SU having ABSENT SUSIs, in order that the HA of SUSI assignments shall become STABLE, which means no QUIESCED/QUIESCING SUSI, etc... Inside node_fail(), any su_si assignments event on ABSENT SUSI toward AMFND likes modification, deletion will ignored. changeset cfdffd52354ba8d00a2fb53de94f28b5f58bbdf5 Author: minh-chau <minh.c...@dektech.com.au> Date: Thu, 18 Aug 2016 09:56:35 +1000 AMFD: Validate headless cached RTA read from IMM [#1725] Since headless interuption is unplanned action and writing rta to IMM is currently queued up in AMFD implemenentation. That can result into inappropriate states of SG fsm state, SUSI fsm state, ha state, SUOperationList, etc. Eventually, AMFD will run into SG unstable, false assertion, or even SUSIs become permanently PARTIALLY, which is hard to debug (even harder without trace) This patch adds a validation routine to check headless cached RTAs read from IMM, more validation rule to be added. Also, a TODO is left for discussion about what's a action should be taken if validation is failed. Complete diffstat: ------------------ osaf/libs/common/amf/d2nedu.c | 5 +- osaf/libs/common/amf/include/amf_d2nmsg.h | 4 + osaf/libs/common/amf/include/amf_si_assign.h | 2 +- osaf/services/saf/amf/amfd/cluster.cc | 9 + osaf/services/saf/amf/amfd/csi.cc | 93 +++++++----- osaf/services/saf/amf/amfd/imm.cc | 5 +- osaf/services/saf/amf/amfd/include/csi.h | 3 +- osaf/services/saf/amf/amfd/include/imm.h | 5 +- osaf/services/saf/amf/amfd/include/mds.h | 7 +- osaf/services/saf/amf/amfd/include/proc.h | 4 +- osaf/services/saf/amf/amfd/include/sg.h | 11 +- osaf/services/saf/amf/amfd/include/su.h | 6 +- osaf/services/saf/amf/amfd/include/susi.h | 13 +- osaf/services/saf/amf/amfd/include/util.h | 2 + osaf/services/saf/amf/amfd/mds.cc | 7 +- osaf/services/saf/amf/amfd/ndfsm.cc | 24 +++- osaf/services/saf/amf/amfd/role.cc | 6 - osaf/services/saf/amf/amfd/sg.cc | 180 +++++++++++++++++++++++++- osaf/services/saf/amf/amfd/sg_2n_fsm.cc | 21 +- osaf/services/saf/amf/amfd/sg_npm_fsm.cc | 2 +- osaf/services/saf/amf/amfd/sg_nwayact_fsm.cc | 2 +- osaf/services/saf/amf/amfd/sgproc.cc | 168 ++++++++++++++---------- osaf/services/saf/amf/amfd/siass.cc | 317 ++++++++++++++++++++++++++++++++++++--------- osaf/services/saf/amf/amfd/su.cc | 128 ++++++++++++++--- osaf/services/saf/amf/amfnd/di.cc | 213 +++++++++++++++++++++--------- osaf/services/saf/amf/amfnd/include/avnd_di.h | 1 + osaf/services/saf/amf/amfnd/include/avnd_mds.h | 4 +- osaf/services/saf/amf/amfnd/mds.cc | 6 +- osaf/services/saf/amf/config/amf_classes.xml | 28 ++++ 29 files changed, 962 insertions(+), 314 deletions(-) Testing Commands: ----------------- Execute the test list attached in ticket #1725 in scope of test for admin continuation and node restart. This series still goes with immediate escalation while headless, which mean node will reboot if kinds of failover switchover. Testing, Expected Results: -------------------------- There are some test cases in non-headless that have already failed without #1725. Tickets were raised for these failing cases, but haven't been fixed. So, if test fails with #1725, please rerun the same test in non-headless Conditions of Submission: ------------------------- ack Arch Built Started Linux distro ------------------------------------------- mips n n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: ------------------- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel