Hi,
Testing of patch is in progress.
Verified in the stable SG case with all repair related attributes true:
  -Node rebooted and thus AMF performed recovery and repair jointly.
-observed alarms and state change notifications and their clearance.

Some initial comments/observations when repair related attributes are false:
  1) If one of the components enters TERM_FAILED state, then it leads to 
termination of all other healthy components as well.
      This is old behavior, but it leads to service outage for work 
loads assigned to these healthy components also.

2) In one of the cases like lock on SU, when a component is getting 
quiesced assignments and its faults leads it to termination
      failure. AMF is still removing the assignments and performing 
failover.
   Reason: AMFND sends assignment responses to AMFD in faults also. AMFD 
processes assignment message and
    sends removal of assignments to AMFND and it goes on.
    In the patches su_oper_event for disabled SU is blocked in 
TERM_FAILED state. But same should be done
    for response of the assignments. Other way is such assignment 
responses  can be dropped at AMFD if it sees SU in TERM_FAILED state like:


diff --git a/osaf/services/saf/amf/amfd/sgproc.cc 
b/osaf/services/saf/amf/amfd/sgproc.cc
--- a/osaf/services/saf/amf/amfd/sgproc.cc
+++ b/osaf/services/saf/amf/amfd/sgproc.cc
@@ -798,6 +798,10 @@ void avd_su_si_assign_evh(AVD_CL_CB *cb,
                         LOG_ER("%s: no susis", __FUNCTION__);
                         goto done;
                 }
+               if (su->saAmfSUPresenceState == 
SA_AMF_PRESENCE_TERMINATION_FAILED) {
+                       TRACE("'%s' in TERM_FAILED state, so dropping 
the assignment response",su->name.value);
+                       goto done;
+               }

                 TRACE("%u", n2d_msg->msg_info.n2d_su_si_assign.msg_act);
                 switch (n2d_msg->msg_info.n2d_su_si_assign.msg_act) {

But with a patch like this, SG will be in unstable state and AMF will 
have to allow only admin repair operation on an unstable SG.

3) On a TERM_FAILED SU,  admin repair was successful and it leaves SU in 
ENALBED and UN-INSTANTIATED state with having assignments.
After this reboot of node is the only option to perform recovery. So 
actually it is working like the case when all the repair related 
attributes are true.


Thanks,
Praveen
On 28-Feb-14 1:24 PM, Hans Feldt wrote:
> Summary: Correct AMF support for TERM-FAILED
> Review request for Trac Ticket(s): 538
> Peer Reviewer(s): Praveen, Nags, Hans N
> Pull request to: <<LIST THE PERSON WITH PUSH ACCESS HERE>>
> Affected branch(es): All
> Development branch: default
>
> --------------------------------
> Impacted area       Impact y/n
> --------------------------------
>   Docs                    n
>   Build system            n
>   RPM/packaging           n
>   Configuration files     n
>   Startup scripts         n
>   SAF services            y
>   OpenSAF services        n
>   Core libraries          n
>   Samples                 n
>   Tests                   n
>   Other                   n
>
>
> Comments (indicate scope for each "y" above):
> ---------------------------------------------
>
> It is very important to get this into the pending releases!
>
>
> changeset 7f72f8d9cbd64fa017a71aa73337b8d74128ade8
> Author:       Hans Feldt <hans.fe...@ericsson.com>
> Date: Fri, 28 Feb 2014 08:12:11 +0100
>
>       amfd: allow modification of node repair attributes [#538]
>
>       To prepare for correct handling of TERMINATION-FAILED it is important 
> that
>       all the repair related attributes of the AMF system model can be 
> changed.
>
>       This patch allows changing saAmfNodeAutoRepair and
>       saAmfNodeFailfastOnTerminationFailure and also logs such change to SAF 
> LOG.
>
> changeset 5069ae52df6a857f374c93dcee4dc364f9f4fd0a
> Author:       Hans Feldt <hans.fe...@ericsson.com>
> Date: Fri, 28 Feb 2014 08:20:51 +0100
>
>       amfd: reboot node when term-failed SU [#538]
>
>       When a component enters the TERM-FAILED presence state and if all the 
> repair
>       conditions on SG and node are true, a node reboot request is ordered. 
> The
>       comp presence state is also SAFlogged.
>
> changeset 785f74ff482ef8e6f644f95cd1064b2d22a86ab1
> Author:       Hans Feldt <hans.fe...@ericsson.com>
> Date: Fri, 28 Feb 2014 08:24:08 +0100
>
>       amfnd: correct term-failed behaviour [#538]
>
>       Problem: possible split brain on application level and spec violation.
>
>       Analysis: The AMF node director requests a comp/SU failover from the AMF
>       director despite that a comp is in TERM-FAILED presence state.
>
>       Change: Correct this behavior and just disable the SU and let the AMF
>       director handle possible node reboot or manual repair.
>
> changeset f56cac35542db8d592e48c758269bb5418aced38
> Author:       Hans Feldt <hans.fe...@ericsson.com>
> Date: Fri, 28 Feb 2014 08:35:29 +0100
>
>       amfd: auto clear comp cleanup failed alarm [#538]
>
>
> Complete diffstat:
> ------------------
>   osaf/services/saf/amf/amfd/comp.cc        |  44 
> +++++++++++++++++++++++++++++++++++++-------
>   osaf/services/saf/amf/amfd/include/util.h |   2 ++
>   osaf/services/saf/amf/amfd/node.cc        |  27 +++++++++++++++++++++++++++
>   osaf/services/saf/amf/amfd/sg.cc          |   4 ++++
>   osaf/services/saf/amf/amfd/sgproc.cc      |  38 
> --------------------------------------
>   osaf/services/saf/amf/amfd/sgtype.cc      |   6 ++++++
>   osaf/services/saf/amf/amfd/util.cc        |  38 
> ++++++++++++++++++++++++++++++++++++++
>   osaf/services/saf/amf/amfnd/clc.cc        |   3 +--
>   osaf/services/saf/amf/amfnd/su.cc         |   1 -
>   osaf/services/saf/amf/amfnd/susm.cc       |  45 
> +++++++--------------------------------------
>   10 files changed, 122 insertions(+), 86 deletions(-)
>
>
> Testing Commands:
> -----------------
>
> Case 1:
> ============
>   2 node cluster, amf demo and the following script run on SC1 (active SC and
>   active demo):
>   
> immcfg -f AppConfig-2N.xml
> amf-adm unlock-in safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
> amf-adm unlock-in safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1
> amf-adm unlock safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
> amf-adm unlock safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1
> sleep 2
>
> immcfg -a saAmfSGAutoRepair=1 safSg=AmfDemo,safApp=AmfDemo1
> immcfg -a saAmfNodeAutoRepair=1 safAmfNode=SC-1,safAmfCluster=myAmfCluster
> immcfg -a saAmfNodeFailfastOnTerminationFailure=1 
> safAmfNode=SC-1,safAmfCluster=myAmfCluster
> immcfg -a saAmfNodeAutoRepair=1 safAmfNode=SC-2,safAmfCluster=myAmfCluster
> immcfg -a saAmfNodeFailfastOnTerminationFailure=1 
> safAmfNode=SC-2,safAmfCluster=myAmfCluster
>
> pkill demo
>
> Case 2:
> ===========
> The same but the saAmfSGAutoRepair=0 and admin repair of SU
>
>
> Testing, Expected Results:
> --------------------------
>
> Case 1:
> ===============
> SC1 rebooted
> demo failed over to SC2
> "component cleanup failed" alarm raised and cleared
> New SAF LOGs to visualize important changes:
>
>          80 08:29:56 02/28/2014 NO safApp=safAmfService "CCB 3 Modified 
> safSg=AmfDemo,safApp=AmfDemo1
>          81 08:29:56 02/28/2014 NO safApp=safAmfService 
> "safSg=AmfDemo,safApp=AmfDemo1 saAmfSGAutoRepair changed to 1
>          82 08:29:56 02/28/2014 NO safApp=safAmfService "CCB 4 Modified 
> safAmfNode=SC-1,safAmfCluster=myAmfCluster
>          83 08:29:56 02/28/2014 NO safApp=safAmfService 
> "safAmfNode=SC-1,safAmfCluster=myAmfCluster saAmfNodeAutoRepair changed to 1
>          84 08:29:56 02/28/2014 NO safApp=safAmfService "CCB 5 Modified 
> safAmfNode=SC-1,safAmfCluster=myAmfCluster
>          85 08:29:56 02/28/2014 NO safApp=safAmfService 
> "safAmfNode=SC-1,safAmfCluster=myAmfCluster 
> saAmfNodeFailfastOnTerminationFailure changed to 1
>          86 08:29:57 02/28/2014 NO safApp=safAmfService "CCB 6 Modified 
> safAmfNode=SC-2,safAmfCluster=myAmfCluster
>          87 08:29:57 02/28/2014 NO safApp=safAmfService 
> "safAmfNode=SC-2,safAmfCluster=myAmfCluster saAmfNodeAutoRepair changed to 1
>          88 08:29:57 02/28/2014 NO safApp=safAmfService "CCB 7 Modified 
> safAmfNode=SC-2,safAmfCluster=myAmfCluster
>          89 08:29:57 02/28/2014 NO safApp=safAmfService 
> "safAmfNode=SC-2,safAmfCluster=myAmfCluster 
> saAmfNodeFailfastOnTerminationFailure changed to 1
>          90 08:29:57 02/28/2014 NO safApp=safAmfService 
> "safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1 PresenceState 
> RESTARTING => TERMINATION_FAILED
>          91 08:29:57 02/28/2014 NO safApp=safAmfService "Ordering reboot of 
> 'safAmfNode=SC-1,safAmfCluster=myAmfCluster' as repair action
>
>
> Case 2:
> =================
>
> Node not rebooted (as expected), repair does not fully work (yet):
>
> Feb 28 08:45:41 SC-1 local0.notice osafimmnd[382]: NO Ccb 6 COMMITTED 
> (immcfg_SC-1_663)
> Feb 28 08:45:41 SC-1 user.notice amf_demo[638]: exiting (caught term signal)
> Feb 28 08:45:41 SC-1 local0.notice osafamfnd[447]: NO 
> 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' faulted due to 
> 'avaDown' : Recovery is 'componentRestart'
> Feb 28 08:45:41 SC-1 local0.notice osafamfnd[447]: NO Cleanup of 
> 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' failed
> Feb 28 08:45:41 SC-1 local0.notice osafamfnd[447]: NO Reason:'Exec of script 
> success, but script exits with non-zero status'
> Feb 28 08:45:41 SC-1 local0.notice osafamfnd[447]: NO Exit code: 1
> Feb 28 08:45:41 SC-1 local0.warn osafamfnd[447]: WA 
> 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State 
> RESTARTING => TERMINATION_FAILED
> Feb 28 08:45:41 SC-1 local0.notice osafamfnd[447]: NO 
> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATED => 
> TERMINATION_FAILED
> Feb 28 08:45:43 SC-1 local0.notice osafamfnd[447]: NO Repair request for 
> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
> Feb 28 08:45:43 SC-1 local0.notice osafamfnd[447]: NO 
> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State TERMINATION_FAILED 
> => UNINSTANTIATED
>
> That the SU stays uninstantiated yet enabled:
>
>          88 08:45:41 02/28/2014 NO safApp=safAmfService 
> "safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1 PresenceState 
> RESTARTING => TERMINATION_FAILED
>          89 08:45:41 02/28/2014 NO safApp=safAmfService 
> "safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1 OperState ENABLED => DISABLED
>          90 08:45:43 02/28/2014 NO safApp=safAmfService "Admin op "REPAIRED" 
> initiated for 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1', invocation: 
> 73014444033
>          91 08:45:43 02/28/2014 NO safApp=safAmfService 
> "safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1 PresenceState TERMINATION_FAILED => 
> UNINSTANTIATED
>          92 08:45:43 02/28/2014 NO safApp=safAmfService "Admin op done for 
> invocation: 73014444033, result 1
>          93 08:45:43 02/28/2014 NO safApp=safAmfService 
> "safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1 OperState DISABLED => ENABLED
>
> even though the repair succeeds
>
>
> Conditions of Submission:
> -------------------------
>   Ack from reviewers
>
>
> Arch      Built     Started    Linux distro
> -------------------------------------------
> mips        n          n
> mips64      n          n
> x86         n          n
> x86_64      y          y
> powerpc     n          n
> powerpc64   n          n
>
>
> Reviewer Checklist:
> -------------------
> [Submitters: make sure that your review doesn't trigger any checkmarks!]
>
>
> Your checkin has not passed review because (see checked entries):
>
> ___ Your RR template is generally incomplete; it has too many blank entries
>      that need proper data filled in.
>
> ___ You have failed to nominate the proper persons for review and push.
>
> ___ Your patches do not have proper short+long header
>
> ___ You have grammar/spelling in your header that is unacceptable.
>
> ___ You have exceeded a sensible line length in your headers/comments/text.
>
> ___ You have failed to put in a proper Trac Ticket # into your commits.
>
> ___ You have incorrectly put/left internal data in your comments/files
>      (i.e. internal bug tracking tool IDs, product names etc)
>
> ___ You have not given any evidence of testing beyond basic build tests.
>      Demonstrate some level of runtime or other sanity testing.
>
> ___ You have ^M present in some of your files. These have to be removed.
>
> ___ You have needlessly changed whitespace or added whitespace crimes
>      like trailing spaces, or spaces before tabs.
>
> ___ You have mixed real technical changes with whitespace and other
>      cosmetic code cleanup changes. These have to be separate commits.
>
> ___ You need to refactor your submission into logical chunks; there is
>      too much content into a single commit.
>
> ___ You have extraneous garbage in your review (merge commits etc)
>
> ___ You have giant attachments which should never have been sent;
>      Instead you should place your content in a public tree to be pulled.
>
> ___ You have too many commits attached to an e-mail; resend as threaded
>      commits, or place in a public tree for a pull.
>
> ___ You have resent this content multiple times without a clear indication
>      of what has changed between each re-send.
>
> ___ You have failed to adequately and individually address all of the
>      comments and change requests that were proposed in the initial review.
>
> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)
>
> ___ Your computer have a badly configured date and time; confusing the
>      the threaded patch review.
>
> ___ Your changes affect IPC mechanism, and you don't present any results
>      for in-service upgradability test.
>
> ___ Your changes affect user manual and documentation, your patch series
>      do not contain the patch that updates the Doxygen manual.
>


------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to