Summary: IMM: Detach of PBE aborts all non-critical and non-empty CCBs [#1261]
Review request for Trac Ticket(s): 1261
Peer Reviewer(s): Neel; Zoran
Pull request to: 
Affected branch(es): default(4.7)
Development branch: default(4.7)

--------------------------------
Impacted area       Impact y/n
--------------------------------
 Docs                    n
 Build system            n
 RPM/packaging           n
 Configuration files     n
 Startup scripts         n
 SAF services            y
 OpenSAF services        n
 Core libraries          n
 Samples                 n
 Tests                   n
 Other                   n


Comments (indicate scope for each "y" above):
---------------------------------------------

changeset 8073b1de0515c7a82cc4cdc89aee39a682a86e06
Author: Anders Bjornerstedt <[email protected]>
Date:   Thu, 21 May 2015 14:26:58 +0200

        IMM: Detach of PBE aborts all non-critical and non-empty CCBs [#1261]

        If the PBE detaches and re-attaches while there are one or more open 
non-
        critical (not yet committing) but non-empty CCBs, then before this
        enhancement one would see the following in the syslog at apply of the 
CCB:

         May 20 13:25:33 SC-2 local0.notice osafimmnd[406]: NO STARTING PBE 
process.
        ...... May 20 13:25:34 SC-2 local0.notice osafimmnd[406]: NO PBE-OI
        established on this SC. Dumping incrementally to file imm.db May 20 
13:25:49
        SC-2 local0.info osafimmnd[406]: IN GOING FROM IMM_CCB_PREPARE to
        IMM_CCB_CRITICAL Ccb:4 May 20 13:25:49 SC-2 user.notice osafimmpbed: NO
        Record for ccb 0x4 not found or found aborted in ok_for_critical May 20
        13:25:49 SC-2 user.warn osafimmpbed: WA WARNING: CCB record for 4 does 
not
        have correct op-count May 20 13:25:49 SC-2 local0.notice 
osafimmnd[406]: NO
        Invalid error reported implementer 'OpenSafImmPBE', Ccb 4 will be 
aborted

        While this does catch the problem and aborts the CCB, the op-count 
mechanism
        that catches this is not intended for handling regular processing 
cases. It
        is an extra safety harness intended to catch bugs, lost messages, or
        incorrect behavior of the PBE.

        This enhancement avoids dependence on the op-count safety harness by 
having
        the restarted PBE (primary or slave) invoking the special 
admin-operation
        that aborts all non-critical CCBs in the immsv. See enhancement ticket 
#1107
        or the IMMSV README for details about his admin-operation.

        The newly (re)started PBE invokes the admin-operation asynchronously to
        avoid getting blocked waiting on reply for this admin-op. The risk of 
the
        admin-op failing is minimal and if it does fail then we end up in the 
same
        distributed logic as we have today. That is we would end up in the 
op-count
        safety-harness. No CCB can get applied without ack from the PBE and so 
the
        admin-operation, if it is successfully received by the IMMND coord, 
should
        result in all currently non-critical CCBS getting aborted before the 
PBE can
        get any completed/apply for such a CCB over FEVS.

        With this enhancement, if the PBE detaches and re-attaches while there 
are
        one or more open non-critical (not yet committing) and non-empty CCBs, 
then
        these CCBs will be aborted. The newly attached PBE may possibly get an 
abort
        callback for such CCbs, but these are ignored by the PBE.

        With this enhancement one will see something like the following in the
        syslog at an attempt tp apply a CCB that was active during detach and 
attach
        of PBE:

        May 21 12:41:34 SC-2 local0.notice osafimmnd[406]: NO Persistent Back 
End OI
        attached, pid: 764 May 21 12:41:34 SC-2 local0.notice osafimmnd[406]: NO
        Received: immadm -o 202 safRdn=immManagement,safApp=safImmService May 21
        12:41:34 SC-2 local0.info osafimmnd[406]: IN sAbortNonCriticalCcbs = 
true;
        May 21 12:41:34 SC-2 local0.notice osafimmnd[406]: NO Implementer 
connected:
        19 (OpenSafImmPBE) <332, 2020f> May 21 12:41:34 SC-2 user.info 
osafimmpbed:
        IN Admop for aborting CCBs result: 1, immsv returned 1 May 21 12:41:34 
SC-2
        user.notice osafimmpbed: NO Update epoch 21 committing with
        ccbId:100000014/4294967316 May 21 12:41:34 SC-2 local0.notice 
osafimmd[396]:
        NO IMMND coord at 2020f May 21 12:41:34 SC-2 local0.info 
osafimmnd[406]: IN
        Update of epoch is PERSISTENT. May 21 12:41:35 SC-2 local0.notice
        osafimmnd[406]: NO PBE-OI established on this SC. Dumping incrementally 
to
        file imm.db May 21 12:41:35 SC-2 local0.info osafimmnd[406]: IN
        sAbortNonCriticalCcbs is true => set max_oi_timeout to 0 May 21 12:41:35
        SC-2 local0.notice osafimmnd[406]: NO CCB 5 aborted by: immadm -o 202
        safRdn=immManagement,safApp=safImmService May 21 12:41:35 SC-2 
local0.info
        osafimmnd[406]: IN sAbortNonCriticalCcbs reset to false May 21 12:41:35 
SC-2
        local0.warn osafimmnd[406]: WA Timeout while waiting for implementer,
        aborting ccb:5 May 21 12:41:35 SC-2 user.warn osafimmpbed: WA Failed to 
find
        CCB object for 5/5 May 21 12:41:45 SC-2 local0.notice osafimmnd[406]: 
NO Ccb
        <5> not in correct state (12) for Apply ignoring request May 21 12:41:45
        SC-2 local0.warn osafimmnd[406]: WA Spurious and redundant ccb-apply 
request
        ignored ccbId:5


Complete diffstat:
------------------
 osaf/services/saf/immsv/immpbed/immpbe.cc |  24 ++++++++++++++++++++++--
 1 files changed, 22 insertions(+), 2 deletions(-)


Testing Commands:
-----------------
I tested using immcfg in explicit commit mode and immapplier for
having some OIs. 


Testing, Expected Results:
--------------------------
This enhancement should be tested on top of enhancement #1107.
Killing the PBE (in 2PBE killing either primary or slave or both)
results in the restarting PBE processes generating the abort ccbs
admin-op. This should nearly always result in any CCBs getting aborted
before the OM client can attempt to apply the CCB. An apply by the
OM client very close in time with the re-attachement could get processed
before the abort-admin-op is acted on by the IMMND coord, but it is 
low probability. If it happens then one could still see the old behavior,
i.e. ending up in the op-count safety-harness. An attempt by the OM-client
to apply when the PBE is absent will result in the abort of the CCB due 
to missing PBE.


Conditions of Submission:
-------------------------
Ack from Neel and Zoran.


Arch      Built     Started    Linux distro
-------------------------------------------
mips        n          n
mips64      n          n
x86         n          n
x86_64      n          n
powerpc     n          n
powerpc64   n          n


Reviewer Checklist:
-------------------
[Submitters: make sure that your review doesn't trigger any checkmarks!]


Your checkin has not passed review because (see checked entries):

___ Your RR template is generally incomplete; it has too many blank entries
    that need proper data filled in.

___ You have failed to nominate the proper persons for review and push.

___ Your patches do not have proper short+long header

___ You have grammar/spelling in your header that is unacceptable.

___ You have exceeded a sensible line length in your headers/comments/text.

___ You have failed to put in a proper Trac Ticket # into your commits.

___ You have incorrectly put/left internal data in your comments/files
    (i.e. internal bug tracking tool IDs, product names etc)

___ You have not given any evidence of testing beyond basic build tests.
    Demonstrate some level of runtime or other sanity testing.

___ You have ^M present in some of your files. These have to be removed.

___ You have needlessly changed whitespace or added whitespace crimes
    like trailing spaces, or spaces before tabs.

___ You have mixed real technical changes with whitespace and other
    cosmetic code cleanup changes. These have to be separate commits.

___ You need to refactor your submission into logical chunks; there is
    too much content into a single commit.

___ You have extraneous garbage in your review (merge commits etc)

___ You have giant attachments which should never have been sent;
    Instead you should place your content in a public tree to be pulled.

___ You have too many commits attached to an e-mail; resend as threaded
    commits, or place in a public tree for a pull.

___ You have resent this content multiple times without a clear indication
    of what has changed between each re-send.

___ You have failed to adequately and individually address all of the
    comments and change requests that were proposed in the initial review.

___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)

___ Your computer have a badly configured date and time; confusing the
    the threaded patch review.

___ Your changes affect IPC mechanism, and you don't present any results
    for in-service upgradability test.

___ Your changes affect user manual and documentation, your patch series
    do not contain the patch that updates the Doxygen manual.


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to