Hi AndersBj, Reviewed and tested the patch. Ack.
while testing in one of the scenario(when a delay is introduced in slave saImmOiRtObjectCreate_2 and saImmOmAdminOwnerSet) ERR_NOT_EXIST is returned. syslog: --------- Apr 8 14:41:35 Slot-3 osafimmpbed: IN saImmRepositoryInit: SA_IMM_KEEP_REPOSITORY - attaching to repository Apr 8 14:41:35 Slot-3 osafimmpbed: NO pbeDaemon starting with obj-count:352 Apr 8 14:41:35 Slot-3 osafimmnd[23043]: NO Persistent Back End OI attached, pid: 23440 Apr 8 14:41:35 Slot-3 osafimmnd[23043]: NO Implementer connected: 12 (OpenSafImmPBE) <438, 2010f> Apr 8 14:41:35 Slot-3 osafimmnd[23043]: NO Persistent Back End OI attached, pid: 23440 Apr 8 14:41:35 Slot-3 osafimmnd[23043]: NO Implementer connected: 13 (OsafImmPbeRt_A) <439, 2010f> Apr 8 14:41:35 Slot-3 osafimmnd[23043]: NO implementer for class 'OpensafImm' is OpenSafImmPBE => class extent is safe. Apr 8 14:41:35 Slot-3 osafimmpbed: IN Primary PBE got ERR_NOT_EXIST on atempt to update epoch towards slave PBE - ignoring Apr 8 14:41:35 Slot-3 osafimmpbed: NO Update epoch 4 committing with ccbId:100000002/4294967298 Apr 8 14:41:36 Slot-3 osafimmnd[23043]: NO PBE-OI established on this SC. Dumping incrementally to file imm.db Apr 8 14:41:36 Slot-3 osafimmnd[23043]: NO Epoch set to 5 in ImmModel Apr 8 14:41:36 Slot-3 osafimmd[23028]: NO Successfully announced dump at node 2020f. New Epoch:5 Apr 8 14:41:36 Slot-3 osafimmd[23028]: NO ACT: New Epoch for IMMND process at node 2010f old epoch: 4 new epoch:5 Apr 8 14:41:36 Slot-3 osafimmpbed: IN Primary PBE got ERR_NOT_EXIST on atempt to update epoch towards slave PBE - ignoring Apr 8 14:41:36 Slot-3 osafimmpbed: NO Update epoch 5 committing with ccbId:100000003/4294967299 Apr 8 14:41:36 Slot-3 osafimmnd[23043]: NO Implementer connected: 14 (implementer_rt_app) <441, 2010f> Apr 8 14:41:36 Slot-3 osafimmpbed: IN Starting distributed PBE commit for PRTO create Ccb:100000004/4294967300 Apr 8 14:41:36 Slot-3 osafimmpbed: WA Start prepare for ccb: 100000004/4294967300 towards slave PBE returned: '12' from Immsv Apr 8 14:41:36 Slot-3 osafimmpbed: WA PBE-A failed to prepare PRTO create Ccb:100000004/4294967300 towards PBE-B Apr 8 14:41:36 Slot-3 osafimmpbed: NO 2PBE Error (20) in PRTO create (ccbId:100000004) Apr 8 14:41:36 Slot-3 osafimmnd[23043]: WA Create of PERSISTENT runtime object 'parent' REVERTED. PBE rc:20 Apr 8 14:41:36 Slot-3 osafimmnd[23043]: NO Implementer locally disconnected. Marking it as doomed 14 <441, 2010f> (implementer_rt_app) Apr 8 14:41:36 Slot-3 osafimmnd[23043]: NO Implementer disconnected 14 <441, 2010f> (implementer_rt_app) Apr 8 14:41:36 Slot-3 osafimmnd[23043]: NO Implementer (applier) connected: 15 (@OpenSafImmPBE) <0, 2020f> Apr 8 14:41:36 Slot-3 osafimmnd[23043]: NO Implementer connected: 16 (OsafImmPbeRt_B) <0, 2020f> Apr 8 14:41:37 Slot-3 osafimmnd[23043]: NO PBE slave established on other SC. Dumping incrementally to file imm.db Apr 8 14:41:40 Slot-3 osafimmpbed: IN Slave PBE replied with OK on attempt to update epoch The ERR_NOT_EXIST is returned because the delay of following error (the joining of slave PBE is arrived late or joined slowly): Apr 8 14:41:36.227575 osafimmnd [23043:ImmModel.cc:9840] T5 Admin op on objectName:osafImmPbeRt=B,opensafImm=opensafImm,safApp=safImmService Apr 8 14:41:36.227591 osafimmnd [23043:ImmModel.cc:9891] T7 ERR_NOT_EXIST: object 'osafImmPbeRt=B,opensafImm=opensafImm,safApp=safImmService' does not exist In this case, also the TRY_AGAIN may be supported for ERR_NOT_EXIST /Neel. On Tuesday 01 April 2014 10:57 PM, Anders Bjornerstedt wrote: > Summary: IMM: 2PBE - PBE-A tolerates BAD_OPERATION reply on ccb-prepare from > PBE-B [#830] > Review request for Trac Ticket(s): 830 > Peer Reviewer(s): Neel > Pull request to: > Affected branch(es): 4.4; default(4.5) > Development branch: > > -------------------------------- > Impacted area Impact y/n > -------------------------------- > Docs n > Build system n > RPM/packaging n > Configuration files n > Startup scripts n > SAF services y > OpenSAF services n > Core libraries n > Samples n > Tests n > Other n > > > Comments (indicate scope for each "y" above): > --------------------------------------------- > > changeset 31109d862e0218f9da84d3bbd152916f32aed31f > Author: Anders Bjornerstedt <anders.bjornerst...@ericsson.com> > Date: Tue, 01 Apr 2014 19:16:46 +0200 > > IMM: 2PBE - PBE-A tolerates BAD_OPERATION reply on ccb-prepare from > PBE-B > [#830] > > An SMF campaign enables the PBE (with 2PBE) and immediately attempts to > update a PRTA. This fails because the slave PBE (PBE-B) has not > completed > its initialization when it receives the prepare message (for the PRTA > update). This causes the PRTA update to be rejected. It also causes the > PBE > slave to exit and restart again due to an erroneous abort of an empty > sqlite > transaction. > > Mar 31 10:33:19 SC-2-1 osafimmnd[13967]: NO ERR_BAD_OPERATION: Mismatch > on > administrative owner '' != 'safImmService' Mar 31 10:33:19 SC-2-1 > osafimmpbed: WA Start prepare for ccb: 100000078/4294967416 towards > slave > PBE returned: '20' from Immsv Mar 31 10:33:19 SC-2-1 osafimmpbed: WA > PBE-A > failed to prepare PRTA update Ccb:100000078/4294967416 towards PBE-B > Mar 31 > 10:33:19 SC-2-1 osafimmpbed: NO 2PBE Error (20) in PRTA update > (ccbId:100000078) Mar 31 10:33:19 SC-2-1 osafimmnd[13967]: WA update of > PERSISTENT runtime attributes in object 'safSmfCampaign=ERIC- > TestAppInstall,safApp=safSmfService' REVERTED. PBE rc:20 > > Mar 31 10:33:22 SC-2-2 osafimmpbed: IN PBE slave waiting for prepare > from > primary on PRTA update ccb:100000078 Mar 31 10:33:22 SC-2-2 > osafimmnd[5243]: > WA update of PERSISTENT runtime attributes in object > 'safSmfCampaign=ERIC- > TestAppInstall,safApp=safSmfService' REVERTED. PBE rc:20 Mar 31 10:33:24 > SC-2-2 osafimmpbed: IN PBE slave waiting for prepare from primary on > PRTA > update ccb:100000078 Mar 31 10:33:24 SC-2-2 osafimmpbed: NO Slave PBE > time- > out in waiting on porepare for PRTA update ccb:100000078 > dn:safSmfCampaign > =ERIC-TestAppInstall,safApp=safSmfService Mar 31 10:33:24 SC-2-2 > osafimmpbed: ER SQL statement ('ROLLBACK') failed because: cannot > rollback - > no transaction is active Mar 31 10:33:24 SC-2-2 osafimmpbed: ER Exiting > (line:2827) > > The problem is the time gap between the creation of the RTO > representing the > slave PBE and the setting of admin-owner by the slave PBE for that RTO. > Admin owner must be set for an admin-operation on the object to succeed. > > The fix is to have the primary PBE be tolerant of receiving > ERR_BAD_OPERATION on the prepare request, treating it the same way it > treats > ERR_NOT_EXIST for the slave PBE RTO not existing, or ERR_TRY_AGAIN for > the > slave PBE still being busy with some other transaction. A fix is also > made > to the pbeAbortTrans function to do nothing if the transaction is empty. > > > Complete diffstat: > ------------------ > osaf/libs/common/immsv/immpbe_dump.cc | 9 +++++---- > osaf/services/saf/immsv/immpbed/immpbe_daemon.cc | 12 ++++++++---- > 2 files changed, 13 insertions(+), 8 deletions(-) > > > Testing Commands: > ----------------- > Really difficult to test. > I had to do elaborate fault injection of sleeps in the PBE and use > a hacked version of immapplier to generate PRTO operations. > > > Testing, Expected Results: > -------------------------- > Any attempt to perform a PRTO create/delete/update or any CCB apply, > that arrives at the slave PBE (PE-B) after it has created its PBE > runtime object, but before it has set admin-owner for it (hint > insert a delay between these); will result in the primary > receiving ERR_BAD_OPERATION from the slave on the admin-op > request for preparing the transaction. Without this path the > primary PBE (PBE-A) will regard the error as failure to process > the PRTO operation and revert it. With this patch. the primary > PBE will treat ERR_BAD_OPERATION the same way as TRY_AGAIN, > with an upper time limit. > > In addition, the slave PBE will not restart due to failure > to abort an empty sqlite transaction. The abort occurs in the > slave without this patch because the slave has received the > operational callback, but never receives the prepare. > > > Conditions of Submission: > ------------------------- > Ack from Neel. > > > Arch Built Started Linux distro > ------------------------------------------- > mips n n > mips64 n n > x86 n n > x86_64 n n > powerpc n n > powerpc64 n n > > > Reviewer Checklist: > ------------------- > [Submitters: make sure that your review doesn't trigger any checkmarks!] > > > Your checkin has not passed review because (see checked entries): > > ___ Your RR template is generally incomplete; it has too many blank entries > that need proper data filled in. > > ___ You have failed to nominate the proper persons for review and push. > > ___ Your patches do not have proper short+long header > > ___ You have grammar/spelling in your header that is unacceptable. > > ___ You have exceeded a sensible line length in your headers/comments/text. > > ___ You have failed to put in a proper Trac Ticket # into your commits. > > ___ You have incorrectly put/left internal data in your comments/files > (i.e. internal bug tracking tool IDs, product names etc) > > ___ You have not given any evidence of testing beyond basic build tests. > Demonstrate some level of runtime or other sanity testing. > > ___ You have ^M present in some of your files. These have to be removed. > > ___ You have needlessly changed whitespace or added whitespace crimes > like trailing spaces, or spaces before tabs. > > ___ You have mixed real technical changes with whitespace and other > cosmetic code cleanup changes. These have to be separate commits. > > ___ You need to refactor your submission into logical chunks; there is > too much content into a single commit. > > ___ You have extraneous garbage in your review (merge commits etc) > > ___ You have giant attachments which should never have been sent; > Instead you should place your content in a public tree to be pulled. > > ___ You have too many commits attached to an e-mail; resend as threaded > commits, or place in a public tree for a pull. > > ___ You have resent this content multiple times without a clear indication > of what has changed between each re-send. > > ___ You have failed to adequately and individually address all of the > comments and change requests that were proposed in the initial review. > > ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc) > > ___ Your computer have a badly configured date and time; confusing the > the threaded patch review. > > ___ Your changes affect IPC mechanism, and you don't present any results > for in-service upgradability test. > > ___ Your changes affect user manual and documentation, your patch series > do not contain the patch that updates the Doxygen manual. > ------------------------------------------------------------------------------ Put Bad Developers to Shame Dominate Development with Jenkins Continuous Integration Continuously Automate Build, Test & Deployment Start a new project now. Try Jenkins in the cloud. http://p.sf.net/sfu/13600_Cloudbees _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel