Summary: IMM: 2PBE - PBE-A tolerates BAD_OPERATION reply on ccb-prepare from PBE-B [#830] Review request for Trac Ticket(s): 830 Peer Reviewer(s): Neel Pull request to: Affected branch(es): 4.4; default(4.5) Development branch:
-------------------------------- Impacted area Impact y/n -------------------------------- Docs n Build system n RPM/packaging n Configuration files n Startup scripts n SAF services y OpenSAF services n Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): --------------------------------------------- changeset 31109d862e0218f9da84d3bbd152916f32aed31f Author: Anders Bjornerstedt <anders.bjornerst...@ericsson.com> Date: Tue, 01 Apr 2014 19:16:46 +0200 IMM: 2PBE - PBE-A tolerates BAD_OPERATION reply on ccb-prepare from PBE-B [#830] An SMF campaign enables the PBE (with 2PBE) and immediately attempts to update a PRTA. This fails because the slave PBE (PBE-B) has not completed its initialization when it receives the prepare message (for the PRTA update). This causes the PRTA update to be rejected. It also causes the PBE slave to exit and restart again due to an erroneous abort of an empty sqlite transaction. Mar 31 10:33:19 SC-2-1 osafimmnd[13967]: NO ERR_BAD_OPERATION: Mismatch on administrative owner '' != 'safImmService' Mar 31 10:33:19 SC-2-1 osafimmpbed: WA Start prepare for ccb: 100000078/4294967416 towards slave PBE returned: '20' from Immsv Mar 31 10:33:19 SC-2-1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000078/4294967416 towards PBE-B Mar 31 10:33:19 SC-2-1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000078) Mar 31 10:33:19 SC-2-1 osafimmnd[13967]: WA update of PERSISTENT runtime attributes in object 'safSmfCampaign=ERIC- TestAppInstall,safApp=safSmfService' REVERTED. PBE rc:20 Mar 31 10:33:22 SC-2-2 osafimmpbed: IN PBE slave waiting for prepare from primary on PRTA update ccb:100000078 Mar 31 10:33:22 SC-2-2 osafimmnd[5243]: WA update of PERSISTENT runtime attributes in object 'safSmfCampaign=ERIC- TestAppInstall,safApp=safSmfService' REVERTED. PBE rc:20 Mar 31 10:33:24 SC-2-2 osafimmpbed: IN PBE slave waiting for prepare from primary on PRTA update ccb:100000078 Mar 31 10:33:24 SC-2-2 osafimmpbed: NO Slave PBE time- out in waiting on porepare for PRTA update ccb:100000078 dn:safSmfCampaign =ERIC-TestAppInstall,safApp=safSmfService Mar 31 10:33:24 SC-2-2 osafimmpbed: ER SQL statement ('ROLLBACK') failed because: cannot rollback - no transaction is active Mar 31 10:33:24 SC-2-2 osafimmpbed: ER Exiting (line:2827) The problem is the time gap between the creation of the RTO representing the slave PBE and the setting of admin-owner by the slave PBE for that RTO. Admin owner must be set for an admin-operation on the object to succeed. The fix is to have the primary PBE be tolerant of receiving ERR_BAD_OPERATION on the prepare request, treating it the same way it treats ERR_NOT_EXIST for the slave PBE RTO not existing, or ERR_TRY_AGAIN for the slave PBE still being busy with some other transaction. A fix is also made to the pbeAbortTrans function to do nothing if the transaction is empty. Complete diffstat: ------------------ osaf/libs/common/immsv/immpbe_dump.cc | 9 +++++---- osaf/services/saf/immsv/immpbed/immpbe_daemon.cc | 12 ++++++++---- 2 files changed, 13 insertions(+), 8 deletions(-) Testing Commands: ----------------- Really difficult to test. I had to do elaborate fault injection of sleeps in the PBE and use a hacked version of immapplier to generate PRTO operations. Testing, Expected Results: -------------------------- Any attempt to perform a PRTO create/delete/update or any CCB apply, that arrives at the slave PBE (PE-B) after it has created its PBE runtime object, but before it has set admin-owner for it (hint insert a delay between these); will result in the primary receiving ERR_BAD_OPERATION from the slave on the admin-op request for preparing the transaction. Without this path the primary PBE (PBE-A) will regard the error as failure to process the PRTO operation and revert it. With this patch. the primary PBE will treat ERR_BAD_OPERATION the same way as TRY_AGAIN, with an upper time limit. In addition, the slave PBE will not restart due to failure to abort an empty sqlite transaction. The abort occurs in the slave without this patch because the slave has received the operational callback, but never receives the prepare. Conditions of Submission: ------------------------- Ack from Neel. Arch Built Started Linux distro ------------------------------------------- mips n n mips64 n n x86 n n x86_64 n n powerpc n n powerpc64 n n Reviewer Checklist: ------------------- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel