[tickets] [opensaf:tickets] #1398 smf: Add capability to redo CCBs that fail
Creation of a handler is ongoing that will contain all IMM handling needed to make a midification of the IMM model. This includes: * Create, Modify and Delete of objects * An easy to use generic C++ API where no IMM APIs has to be handled * Handling all needed IMM (C) APIs * Handling all rules associated with usage of the IMM APIs * Handling all possible recovery when IMM APIs returns something else than OK * Etc... Attached is a .h file with a proposed API for this handling Attachments: - [immccb.h](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/2179c610/dd3c/attachment/immccb.h) (14.1 kB; application/octet-stream) --- ** [tickets:#1398] smf: Add capability to redo CCBs that fail ** **Status:** accepted **Milestone:** 5.18.01 **Created:** Wed Jul 01, 2015 02:07 PM UTC by Rafael Odzakow **Last Updated:** Mon Nov 20, 2017 03:46 PM UTC **Owner:** elunlen CCBs may fail for a variety of resource related reasons. SMF campaigns can be made more robust if they are capable of redoing/replaying a CCB that has been aborted. A CCB that is aborted due to validation error will not succeed when replayed, but no damage will be done either. A CCB that is aborted due to resource reasons may succeed when replayed, avoiding the abandonement of the whole campaign. During the final stages of an upgrade campaign PBE is enabled. PBE is not ready until it attaches, so CCB operations will get TRY_AGAIN in that window. Once the PBE has attached the IMM is persistent-write-available and CCB operations are allowed again. Any CCB started and adding operations *before* the PBE was enabled by a CCB, will be a doomed CCB. This since the CCBs generated operations before the PBE was enabled and thus before the PBE was even starting and thus the PBE will be unaware of these pre-PBE-enable operations. Such a CCB would fail on an op-count check in the CCB commit processing of that CCB in the PBE. In 4.7-tentative an enhancement #1261 was implemented in the IMM service to make this abort cleaner, i.e. to avoid the ugly op-count error in the PBE. The PBE generates an admin-operation to abort *all* open CCBs (all CCBs that are active but not critical), just before attaching. The problem was that the first implementation of #1261 resulted in the PBE often attaching as OI *before* the abort of non-critical CCBs had been processed. When the abort requested by the PBE was finally processed it aborted also "innocent" CCBs that had actually started *after* the PBE was attached as PBE-OI. The syndrome as such, i.e. attach of PBE causing the abort of a valid CCB, could still happen on earlier releases but was quite rare. The syslog would then show the op-count error reported by the PBE. A possible improvement in SMF is to read the runtime-attribute: opensafImmNostdFlags in the OpenSAF IMM object opensafImm=opensafImm,safApp=safImmService and check that it is not which would mean that PBE is attached. But it is not really clear why this is needed in 4.7-tentative when it was not needed earlier. CCBs may actually get aborted due to resource error at any time and not only in conjunction with PBE enable. A general increase of the robustness of SMF campaigns could be achieved by adding logic for redoing CCBs that fail unexpectedly. If such a CCB was valid, i.e. it was aborted due to resource error and not validation error, then it has a high probability of succeeding when retried. IMM ticked related to this: #1261 Jun 29 10:36:35 SC-2-2 osafimmpbed: IN Admop for aborting CCBs result: 1, immsv returned 1 Jun 29 10:36:35 SC-2-2 osafimmpbed: NO Update epoch 63 committing with ccbId:10185/4294967685 Jun 29 10:36:36 SC-2-2 osafsmfd[4726]: NO CAMP: Start campaign complete actions (95) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Create of PERSISTENT runtime object 'smfRollbackElement=CampComplete,safSmfCampaign=ERIC-CMWUpgrade,safApp=safSmfService' (safSmfCampaign). Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 305 COMMITTED (immcfg_SC-2-1_14718) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 306 COMMITTED (immcfg_SC-2-1_14741) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 307 COMMITTED (immcfg_SC-2-1_14764) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 308 COMMITTED (immcfg_SC-2-1_14787) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 309 COMMITTED (immcfg_SC-2-1_14810) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 310 COMMITTED (immcfg_SC-2-1_14833) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 311 COMMITTED (immcfg_SC-2-1_14856) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 312 COMMITTED (immcfg_SC-2-1_14879) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Create of PERSISTENT runtime object 'smfRollbackElement=ccb_0002,smfRollbackElement=CampComplete,safSmfCampaign=ERIC-CMWUpgrade,safApp=safSmfService' (safSmfCampaign). Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO PBE-OI established on this SC. Dumping incrementally to file imm.db Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO CCB 313 aborted by:
[tickets] [opensaf:tickets] #1398 smf: Add capability to redo CCBs that fail
- **status**: unassigned --> accepted - **assigned_to**: elunlen - **Blocker**: --> False - **Milestone**: future --> 5.18.01 --- ** [tickets:#1398] smf: Add capability to redo CCBs that fail ** **Status:** accepted **Milestone:** 5.18.01 **Created:** Wed Jul 01, 2015 02:07 PM UTC by Rafael Odzakow **Last Updated:** Wed Jul 15, 2015 12:02 PM UTC **Owner:** elunlen CCBs may fail for a variety of resource related reasons. SMF campaigns can be made more robust if they are capable of redoing/replaying a CCB that has been aborted. A CCB that is aborted due to validation error will not succeed when replayed, but no damage will be done either. A CCB that is aborted due to resource reasons may succeed when replayed, avoiding the abandonement of the whole campaign. During the final stages of an upgrade campaign PBE is enabled. PBE is not ready until it attaches, so CCB operations will get TRY_AGAIN in that window. Once the PBE has attached the IMM is persistent-write-available and CCB operations are allowed again. Any CCB started and adding operations *before* the PBE was enabled by a CCB, will be a doomed CCB. This since the CCBs generated operations before the PBE was enabled and thus before the PBE was even starting and thus the PBE will be unaware of these pre-PBE-enable operations. Such a CCB would fail on an op-count check in the CCB commit processing of that CCB in the PBE. In 4.7-tentative an enhancement #1261 was implemented in the IMM service to make this abort cleaner, i.e. to avoid the ugly op-count error in the PBE. The PBE generates an admin-operation to abort *all* open CCBs (all CCBs that are active but not critical), just before attaching. The problem was that the first implementation of #1261 resulted in the PBE often attaching as OI *before* the abort of non-critical CCBs had been processed. When the abort requested by the PBE was finally processed it aborted also "innocent" CCBs that had actually started *after* the PBE was attached as PBE-OI. The syndrome as such, i.e. attach of PBE causing the abort of a valid CCB, could still happen on earlier releases but was quite rare. The syslog would then show the op-count error reported by the PBE. A possible improvement in SMF is to read the runtime-attribute: opensafImmNostdFlags in the OpenSAF IMM object opensafImm=opensafImm,safApp=safImmService and check that it is not which would mean that PBE is attached. But it is not really clear why this is needed in 4.7-tentative when it was not needed earlier. CCBs may actually get aborted due to resource error at any time and not only in conjunction with PBE enable. A general increase of the robustness of SMF campaigns could be achieved by adding logic for redoing CCBs that fail unexpectedly. If such a CCB was valid, i.e. it was aborted due to resource error and not validation error, then it has a high probability of succeeding when retried. IMM ticked related to this: #1261 Jun 29 10:36:35 SC-2-2 osafimmpbed: IN Admop for aborting CCBs result: 1, immsv returned 1 Jun 29 10:36:35 SC-2-2 osafimmpbed: NO Update epoch 63 committing with ccbId:10185/4294967685 Jun 29 10:36:36 SC-2-2 osafsmfd[4726]: NO CAMP: Start campaign complete actions (95) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Create of PERSISTENT runtime object 'smfRollbackElement=CampComplete,safSmfCampaign=ERIC-CMWUpgrade,safApp=safSmfService' (safSmfCampaign). Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 305 COMMITTED (immcfg_SC-2-1_14718) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 306 COMMITTED (immcfg_SC-2-1_14741) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 307 COMMITTED (immcfg_SC-2-1_14764) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 308 COMMITTED (immcfg_SC-2-1_14787) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 309 COMMITTED (immcfg_SC-2-1_14810) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 310 COMMITTED (immcfg_SC-2-1_14833) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 311 COMMITTED (immcfg_SC-2-1_14856) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 312 COMMITTED (immcfg_SC-2-1_14879) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Create of PERSISTENT runtime object 'smfRollbackElement=ccb_0002,smfRollbackElement=CampComplete,safSmfCampaign=ERIC-CMWUpgrade,safApp=safSmfService' (safSmfCampaign). Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO PBE-OI established on this SC. Dumping incrementally to file imm.db Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO CCB 313 aborted by: immadm -o 202 safRdn=immManagement,safApp=safImmService Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: WA Timeout while waiting for implementer, aborting ccb:313 Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 313 ABORTED (SMFSERVICE) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: WA >>s_info->to_svc == 0<< reply context destroyed before this reply could be made Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: WA Failed to send response to agent/client over MDS Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb <313> not in correct
[tickets] [opensaf:tickets] #1398 smf: Add capability to redo CCBs that fail
- **Milestone**: 4.7-Tentative --> future --- ** [tickets:#1398] smf: Add capability to redo CCBs that fail ** **Status:** unassigned **Milestone:** future **Created:** Wed Jul 01, 2015 02:07 PM UTC by Rafael **Last Updated:** Mon Jul 13, 2015 09:51 AM UTC **Owner:** nobody CCBs may fail for a variety of resource related reasons. SMF campaigns can be made more robust if they are capable of redoing/replaying a CCB that has been aborted. A CCB that is aborted due to validation error will not succeed when replayed, but no damage will be done either. A CCB that is aborted due to resource reasons may succeed when replayed, avoiding the abandonement of the whole campaign. During the final stages of an upgrade campaign PBE is enabled. PBE is not ready until it attaches, so CCB operations will get TRY_AGAIN in that window. Once the PBE has attached the IMM is persistent-write-available and CCB operations are allowed again. Any CCB started and adding operations *before* the PBE was enabled by a CCB, will be a doomed CCB. This since the CCBs generated operations before the PBE was enabled and thus before the PBE was even starting and thus the PBE will be unaware of these pre-PBE-enable operations. Such a CCB would fail on an op-count check in the CCB commit processing of that CCB in the PBE. In 4.7-tentative an enhancement #1261 was implemented in the IMM service to make this abort cleaner, i.e. to avoid the ugly op-count error in the PBE. The PBE generates an admin-operation to abort *all* open CCBs (all CCBs that are active but not critical), just before attaching. The problem was that the first implementation of #1261 resulted in the PBE often attaching as OI *before* the abort of non-critical CCBs had been processed. When the abort requested by the PBE was finally processed it aborted also "innocent" CCBs that had actually started *after* the PBE was attached as PBE-OI. The syndrome as such, i.e. attach of PBE causing the abort of a valid CCB, could still happen on earlier releases but was quite rare. The syslog would then show the op-count error reported by the PBE. A possible improvement in SMF is to read the runtime-attribute: opensafImmNostdFlags in the OpenSAF IMM object opensafImm=opensafImm,safApp=safImmService and check that it is not which would mean that PBE is attached. But it is not really clear why this is needed in 4.7-tentative when it was not needed earlier. CCBs may actually get aborted due to resource error at any time and not only in conjunction with PBE enable. A general increase of the robustness of SMF campaigns could be achieved by adding logic for redoing CCBs that fail unexpectedly. If such a CCB was valid, i.e. it was aborted due to resource error and not validation error, then it has a high probability of succeeding when retried. IMM ticked related to this: #1261 Jun 29 10:36:35 SC-2-2 osafimmpbed: IN Admop for aborting CCBs result: 1, immsv returned 1 Jun 29 10:36:35 SC-2-2 osafimmpbed: NO Update epoch 63 committing with ccbId:10185/4294967685 Jun 29 10:36:36 SC-2-2 osafsmfd[4726]: NO CAMP: Start campaign complete actions (95) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Create of PERSISTENT runtime object 'smfRollbackElement=CampComplete,safSmfCampaign=ERIC-CMWUpgrade,safApp=safSmfService' (safSmfCampaign). Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 305 COMMITTED (immcfg_SC-2-1_14718) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 306 COMMITTED (immcfg_SC-2-1_14741) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 307 COMMITTED (immcfg_SC-2-1_14764) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 308 COMMITTED (immcfg_SC-2-1_14787) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 309 COMMITTED (immcfg_SC-2-1_14810) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 310 COMMITTED (immcfg_SC-2-1_14833) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 311 COMMITTED (immcfg_SC-2-1_14856) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 312 COMMITTED (immcfg_SC-2-1_14879) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Create of PERSISTENT runtime object 'smfRollbackElement=ccb_0002,smfRollbackElement=CampComplete,safSmfCampaign=ERIC-CMWUpgrade,safApp=safSmfService' (safSmfCampaign). Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO PBE-OI established on this SC. Dumping incrementally to file imm.db Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO CCB 313 aborted by: immadm -o 202 safRdn=immManagement,safApp=safImmService Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: WA Timeout while waiting for implementer, aborting ccb:313 Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 313 ABORTED (SMFSERVICE) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: WA >>s_info->to_svc == 0<< reply context destroyed before this reply could be made Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: WA Failed to send response to agent/client over MDS Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb <313> not in correct state (12) for Apply ignoring request Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: WA Spurious an
[tickets] [opensaf:tickets] #1398 smf: Add capability to redo CCBs that fail
- **summary**: smf: ccb fails after pbe enable --> smf: Add capability to redo CCBs that fail - Description has changed: Diff: --- old +++ new @@ -1,3 +1,11 @@ +CCBs may fail for a variety of resource related reasons. SMF campaigns can +be made more robust if they are capable of redoing/replaying a CCB that has +been aborted. A CCB that is aborted due to validation error will not succeed +when replayed, but no damage will be done either. A CCB that is aborted due to +resource reasons may succeed when replayed, avoiding the abandonement of the +whole campaign. + + During the final stages of an upgrade campaign PBE is enabled. PBE is not ready until it attaches, so CCB operations will get TRY_AGAIN in that window. Once the PBE has attached the IMM is persistent-write-available and CCB operations are allowed again. - **Comment**: Now that the second fix/patch for #1261 has been pushed, please try to redo the test behind this ticket. The problem should be less prevalent now. I still think SMF should look into the possibility of redoing CCBs. Changing the slogan for this ticket to address the more general enhancement of SMF being capable of re-doing/re-playing CCBs. --- ** [tickets:#1398] smf: Add capability to redo CCBs that fail ** **Status:** unassigned **Milestone:** 4.7-Tentative **Created:** Wed Jul 01, 2015 02:07 PM UTC by Rafael **Last Updated:** Thu Jul 09, 2015 11:31 AM UTC **Owner:** nobody CCBs may fail for a variety of resource related reasons. SMF campaigns can be made more robust if they are capable of redoing/replaying a CCB that has been aborted. A CCB that is aborted due to validation error will not succeed when replayed, but no damage will be done either. A CCB that is aborted due to resource reasons may succeed when replayed, avoiding the abandonement of the whole campaign. During the final stages of an upgrade campaign PBE is enabled. PBE is not ready until it attaches, so CCB operations will get TRY_AGAIN in that window. Once the PBE has attached the IMM is persistent-write-available and CCB operations are allowed again. Any CCB started and adding operations *before* the PBE was enabled by a CCB, will be a doomed CCB. This since the CCBs generated operations before the PBE was enabled and thus before the PBE was even starting and thus the PBE will be unaware of these pre-PBE-enable operations. Such a CCB would fail on an op-count check in the CCB commit processing of that CCB in the PBE. In 4.7-tentative an enhancement #1261 was implemented in the IMM service to make this abort cleaner, i.e. to avoid the ugly op-count error in the PBE. The PBE generates an admin-operation to abort *all* open CCBs (all CCBs that are active but not critical), just before attaching. The problem was that the first implementation of #1261 resulted in the PBE often attaching as OI *before* the abort of non-critical CCBs had been processed. When the abort requested by the PBE was finally processed it aborted also "innocent" CCBs that had actually started *after* the PBE was attached as PBE-OI. The syndrome as such, i.e. attach of PBE causing the abort of a valid CCB, could still happen on earlier releases but was quite rare. The syslog would then show the op-count error reported by the PBE. A possible improvement in SMF is to read the runtime-attribute: opensafImmNostdFlags in the OpenSAF IMM object opensafImm=opensafImm,safApp=safImmService and check that it is not which would mean that PBE is attached. But it is not really clear why this is needed in 4.7-tentative when it was not needed earlier. CCBs may actually get aborted due to resource error at any time and not only in conjunction with PBE enable. A general increase of the robustness of SMF campaigns could be achieved by adding logic for redoing CCBs that fail unexpectedly. If such a CCB was valid, i.e. it was aborted due to resource error and not validation error, then it has a high probability of succeeding when retried. IMM ticked related to this: #1261 Jun 29 10:36:35 SC-2-2 osafimmpbed: IN Admop for aborting CCBs result: 1, immsv returned 1 Jun 29 10:36:35 SC-2-2 osafimmpbed: NO Update epoch 63 committing with ccbId:10185/4294967685 Jun 29 10:36:36 SC-2-2 osafsmfd[4726]: NO CAMP: Start campaign complete actions (95) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Create of PERSISTENT runtime object 'smfRollbackElement=CampComplete,safSmfCampaign=ERIC-CMWUpgrade,safApp=safSmfService' (safSmfCampaign). Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 305 COMMITTED (immcfg_SC-2-1_14718) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 306 COMMITTED (immcfg_SC-2-1_14741) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 307 COMMITTED (immcfg_SC-2-1_14764) Jun 29 10:36:36 SC-2-2 osafimmnd[4476]: NO Ccb 308 COMMITTED (immcfg_SC-2-1_14787) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 309 COMMITTED (immcfg_SC-2-1_14810) Jun 29 10:36:37 SC-2-2 osafimmnd[4476]: NO Ccb 310 COMMITTED (immcfg_SC-2-1_148