Summary: IMM: Detach of PBE aborts all non-critical and non-empty CCBs [#1261] Review request for Trac Ticket(s): 1261 Peer Reviewer(s): Neel; Zoran Pull request to: Affected branch(es): default(4.7) Development branch: default(4.7)
-------------------------------- Impacted area Impact y/n -------------------------------- Docs n Build system n RPM/packaging n Configuration files n Startup scripts n SAF services y OpenSAF services n Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): --------------------------------------------- changeset 8073b1de0515c7a82cc4cdc89aee39a682a86e06 Author: Anders Bjornerstedt <[email protected]> Date: Thu, 21 May 2015 14:26:58 +0200 IMM: Detach of PBE aborts all non-critical and non-empty CCBs [#1261] If the PBE detaches and re-attaches while there are one or more open non- critical (not yet committing) but non-empty CCBs, then before this enhancement one would see the following in the syslog at apply of the CCB: May 20 13:25:33 SC-2 local0.notice osafimmnd[406]: NO STARTING PBE process. ...... May 20 13:25:34 SC-2 local0.notice osafimmnd[406]: NO PBE-OI established on this SC. Dumping incrementally to file imm.db May 20 13:25:49 SC-2 local0.info osafimmnd[406]: IN GOING FROM IMM_CCB_PREPARE to IMM_CCB_CRITICAL Ccb:4 May 20 13:25:49 SC-2 user.notice osafimmpbed: NO Record for ccb 0x4 not found or found aborted in ok_for_critical May 20 13:25:49 SC-2 user.warn osafimmpbed: WA WARNING: CCB record for 4 does not have correct op-count May 20 13:25:49 SC-2 local0.notice osafimmnd[406]: NO Invalid error reported implementer 'OpenSafImmPBE', Ccb 4 will be aborted While this does catch the problem and aborts the CCB, the op-count mechanism that catches this is not intended for handling regular processing cases. It is an extra safety harness intended to catch bugs, lost messages, or incorrect behavior of the PBE. This enhancement avoids dependence on the op-count safety harness by having the restarted PBE (primary or slave) invoking the special admin-operation that aborts all non-critical CCBs in the immsv. See enhancement ticket #1107 or the IMMSV README for details about his admin-operation. The newly (re)started PBE invokes the admin-operation asynchronously to avoid getting blocked waiting on reply for this admin-op. The risk of the admin-op failing is minimal and if it does fail then we end up in the same distributed logic as we have today. That is we would end up in the op-count safety-harness. No CCB can get applied without ack from the PBE and so the admin-operation, if it is successfully received by the IMMND coord, should result in all currently non-critical CCBS getting aborted before the PBE can get any completed/apply for such a CCB over FEVS. With this enhancement, if the PBE detaches and re-attaches while there are one or more open non-critical (not yet committing) and non-empty CCBs, then these CCBs will be aborted. The newly attached PBE may possibly get an abort callback for such CCbs, but these are ignored by the PBE. With this enhancement one will see something like the following in the syslog at an attempt tp apply a CCB that was active during detach and attach of PBE: May 21 12:41:34 SC-2 local0.notice osafimmnd[406]: NO Persistent Back End OI attached, pid: 764 May 21 12:41:34 SC-2 local0.notice osafimmnd[406]: NO Received: immadm -o 202 safRdn=immManagement,safApp=safImmService May 21 12:41:34 SC-2 local0.info osafimmnd[406]: IN sAbortNonCriticalCcbs = true; May 21 12:41:34 SC-2 local0.notice osafimmnd[406]: NO Implementer connected: 19 (OpenSafImmPBE) <332, 2020f> May 21 12:41:34 SC-2 user.info osafimmpbed: IN Admop for aborting CCBs result: 1, immsv returned 1 May 21 12:41:34 SC-2 user.notice osafimmpbed: NO Update epoch 21 committing with ccbId:100000014/4294967316 May 21 12:41:34 SC-2 local0.notice osafimmd[396]: NO IMMND coord at 2020f May 21 12:41:34 SC-2 local0.info osafimmnd[406]: IN Update of epoch is PERSISTENT. May 21 12:41:35 SC-2 local0.notice osafimmnd[406]: NO PBE-OI established on this SC. Dumping incrementally to file imm.db May 21 12:41:35 SC-2 local0.info osafimmnd[406]: IN sAbortNonCriticalCcbs is true => set max_oi_timeout to 0 May 21 12:41:35 SC-2 local0.notice osafimmnd[406]: NO CCB 5 aborted by: immadm -o 202 safRdn=immManagement,safApp=safImmService May 21 12:41:35 SC-2 local0.info osafimmnd[406]: IN sAbortNonCriticalCcbs reset to false May 21 12:41:35 SC-2 local0.warn osafimmnd[406]: WA Timeout while waiting for implementer, aborting ccb:5 May 21 12:41:35 SC-2 user.warn osafimmpbed: WA Failed to find CCB object for 5/5 May 21 12:41:45 SC-2 local0.notice osafimmnd[406]: NO Ccb <5> not in correct state (12) for Apply ignoring request May 21 12:41:45 SC-2 local0.warn osafimmnd[406]: WA Spurious and redundant ccb-apply request ignored ccbId:5 Complete diffstat: ------------------ osaf/services/saf/immsv/immpbed/immpbe.cc | 24 ++++++++++++++++++++++-- 1 files changed, 22 insertions(+), 2 deletions(-) Testing Commands: ----------------- I tested using immcfg in explicit commit mode and immapplier for having some OIs. Testing, Expected Results: -------------------------- This enhancement should be tested on top of enhancement #1107. Killing the PBE (in 2PBE killing either primary or slave or both) results in the restarting PBE processes generating the abort ccbs admin-op. This should nearly always result in any CCBs getting aborted before the OM client can attempt to apply the CCB. An apply by the OM client very close in time with the re-attachement could get processed before the abort-admin-op is acted on by the IMMND coord, but it is low probability. If it happens then one could still see the old behavior, i.e. ending up in the op-count safety-harness. An attempt by the OM-client to apply when the PBE is absent will result in the abort of the CCB due to missing PBE. Conditions of Submission: ------------------------- Ack from Neel and Zoran. Arch Built Started Linux distro ------------------------------------------- mips n n mips64 n n x86 n n x86_64 n n powerpc n n powerpc64 n n Reviewer Checklist: ------------------- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
