Summary: IMM: 2PBE slave uses main/applier thread for PRT data flow [#623]
Review request for Trac Ticket(s): #623
Peer Reviewer(s): Neel
Pull request to: 
Affected branch(es): 4.4 (default(4.5)
Development branch:

--------------------------------
Impacted area       Impact y/n
--------------------------------
 Docs                    n
 Build system            n
 RPM/packaging           n
 Configuration files     n
 Startup scripts         n
 SAF services            y
 OpenSAF services        n
 Core libraries          n
 Samples                 n
 Tests                   n
 Other                   n


Comments (indicate scope for each "y" above):
---------------------------------------------

changeset e6d11de4f7800aee522088ccee189f992bf7a2f0
Author: Anders Bjornerstedt <anders.bjornerst...@ericsson.com>
Date:   Wed, 12 Feb 2014 19:29:47 +0100

        IMM: 2PBE slave uses main/applier thread for PRT data flow [#623]

        The main cause for the unclean cutoff was that the 2PBE slave used the
        runtime thread for dataflow for PRTO-create, PRTO-delete and 
PRTA-update.

        The termination of PBE is triggered by a regular CCB commit, which 
arrives
        via the main/applier thread in the 2PBE slave. As soon as the PBE
        termination CCB has comitted in imm-ram, the IMMNDs at the SCs will 
order
        their PBEs to terminate by sending a SIGTERM to their PBE process. The
        signal is caught and translated to an event on a termination descriptor
        polled by the main/applier thread.

        Because the 2PBE-slave did essentially all processing of PRTO ops in the
        runtime thread, they would esily be interrupted and terminated from the
        main/applier thread, before completing their current PRT job in the 
runtime
        thread.

        Such interruption/termination of the process, by a searate thread is 
very
        timing sensitive. This sometimes caused a discrepancy in termination 
between
        the primary PBE and the slave PBE.

        By transferring all the data flow for PRTO operations to the 
main/applier
        thread in the slave, any on-going PRTO job in the slave will complete 
before
        the termination job is excecuted by the main-thread.

        The spinlock mechanism added for 2PBE to synchronize the use of sqlite
        transactions between the slave's main/applier thread and runtime thread 
has
        been elaborated to eliminate the risk of the sqlite transaction being
        comitted before the its transaction buildup has been completed. This is
        needed now because the buildup of PRTO-create and PRTA-update in the 
slave
        is executed by the runtime thread, but the sqlite commit in the slave is
        handled by the applier/main thread. So the main thead has to wait not 
just
        for the start of the transaction, but also for the completion of the
        prepare/buildup.

        This patch also does a bit of minor cleanup in the fuctions touched.

changeset 5df9398a1210a8144c8d3ea2449f576e4c370896
Author: Anders Bjornerstedt <anders.bjornerst...@ericsson.com>
Date:   Wed, 12 Feb 2014 19:36:52 +0100

        IMM: 2PBE slave wait for prepare from primary in PRT-create & PRT-update
        [#623]

        Risk for discrepancy between 2PBE slave and primary is reduced further 
by
        having the slave wait for ok_to_prepare trigger from primary for the 
weak-
        ccb associated with the PRT operations. The 2PBE primary will also not
        perform the PRT operation if it does not get ok reply from slave on the
        prepare call, This follows the same pattern as existed earlier (in 4.4 
2PBE)
        for regular CCBs and for PRO deletes.

        The ok_to_prepare call is a synchronous admin-opertion invoked from 
primary
        to slave. At primary it blocks the main thread waiting for reply. At 
slave
        it is received and executed by the slave's runtime thread, where the 
slave's
        main thread should have received the data and be blocked waiting for 
this
        ok_to_pripare.

        The risk for discrepancy in the sqlite commit to the two pbe files 
should be
        minimal and only possible by crash or SIGKILL of one or both PBEs during
        sqlite commit processing. A SIGTERM on the PBEs should not be able to 
cause
        a discrepancy.


Complete diffstat:
------------------
 osaf/libs/agents/saf/imma/imma_oi_api.c          |   10 +-
 osaf/libs/agents/saf/imma/imma_proc.c            |    6 +-
 osaf/libs/common/immsv/immpbe_dump.cc            |   60 ++++++++-
 osaf/libs/common/immsv/include/immpbe_dump.hh    |    3 +-
 osaf/services/saf/immsv/immnd/ImmModel.cc        |    2 +-
 osaf/services/saf/immsv/immnd/immnd_evt.c        |    1 -
 osaf/services/saf/immsv/immpbed/immpbe_daemon.cc |  370 
++++++++++++++++++++++++++++++++++-----------------------
 7 files changed, 286 insertions(+), 166 deletions(-)


Testing Commands:
-----------------
Relevant tests would be ones excercising PRTO-create/delete and PRTA-update.
Plus disabling and enabling the PBE. 
The simplest wait to detect if there is a logical discrepancy between the
files is to perform a cluster load. One can see in the syslog if the 2PBE
arbitration detected any discrepancy or not. IF the PBE was terminated 
in a controlled way before the cluster restart, there should never be
any such discrepancy. 

Alternative ways to detec discrepancy, without restarting the cluster exists,
but hey require the PBE to be stopped/disabled and then some processing of 
the sqlite files. There may be spurious (non logical) differences in the
sqlite file related to object-id, a numerical key assigned locally in each
PBE to each imm-object stored in the file. The object-id would be the same
in the two files after a cluster start. But as soon as one or both of the 
PBE processes has restarted (without a cluster restart) then the object-ids
in the sqlite files will diverge. 


Testing, Expected Results:
--------------------------
This is hard to test.
WE have tests here which detected the discrepancy.
We will try to re-run these tests here before pushing.
At the same time I would like to have this fix be part of 4.4.0  so
please run plain regression testing with this fix.

The fix *should* not affect 0PBE and 1PBE, but one never knows for sure. 


Conditions of Submission:
-------------------------
Ack from Neel.


Arch      Built     Started    Linux distro
-------------------------------------------
mips        n          n
mips64      n          n
x86         n          n
x86_64      n          n
powerpc     n          n
powerpc64   n          n


Reviewer Checklist:
-------------------
[Submitters: make sure that your review doesn't trigger any checkmarks!]


Your checkin has not passed review because (see checked entries):

___ Your RR template is generally incomplete; it has too many blank entries
    that need proper data filled in.

___ You have failed to nominate the proper persons for review and push.

___ Your patches do not have proper short+long header

___ You have grammar/spelling in your header that is unacceptable.

___ You have exceeded a sensible line length in your headers/comments/text.

___ You have failed to put in a proper Trac Ticket # into your commits.

___ You have incorrectly put/left internal data in your comments/files
    (i.e. internal bug tracking tool IDs, product names etc)

___ You have not given any evidence of testing beyond basic build tests.
    Demonstrate some level of runtime or other sanity testing.

___ You have ^M present in some of your files. These have to be removed.

___ You have needlessly changed whitespace or added whitespace crimes
    like trailing spaces, or spaces before tabs.

___ You have mixed real technical changes with whitespace and other
    cosmetic code cleanup changes. These have to be separate commits.

___ You need to refactor your submission into logical chunks; there is
    too much content into a single commit.

___ You have extraneous garbage in your review (merge commits etc)

___ You have giant attachments which should never have been sent;
    Instead you should place your content in a public tree to be pulled.

___ You have too many commits attached to an e-mail; resend as threaded
    commits, or place in a public tree for a pull.

___ You have resent this content multiple times without a clear indication
    of what has changed between each re-send.

___ You have failed to adequately and individually address all of the
    comments and change requests that were proposed in the initial review.

___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)

___ Your computer have a badly configured date and time; confusing the
    the threaded patch review.

___ Your changes affect IPC mechanism, and you don't present any results
    for in-service upgradability test.

___ Your changes affect user manual and documentation, your patch series
    do not contain the patch that updates the Doxygen manual.


------------------------------------------------------------------------------
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to