Anders Björnerstedt wrote:
> The basic issue as I see it is that the AMF (AMFD or AMFND) must not put 
> itself
> into a postion so that it must acutely read from the IMM as part of a time 
> critical
> task (such as failover or switchover). The IMMND can crash and is restartable.
>
> While rare, when it does happen, local reads from IMM will be postponed until 
> the local
> IMMND has been restarted and synced. This can take up to (but not more than) 
> 60 seconds.
>
> For switch-over, if any reads are needed, they should be done before 
> accepting the 
> swirchover in a "prepare phase". Once read (by the relevant AMFND) the 
> switchover is 
> OK'ed and executed. Durring  the switchover no reads should be needed since 
> they should
> Have been done.
>   
If it was not obvious: If the read fails in the prepare phase, this 
switch-over request is rejected,
with  TRY_AGAIN or NO_RESOURCES and an error string.

> For failover the above strategy is not possible. But at least SC failover can 
> not
> Happen if the IMMND goies down *also* on the remaining SC, since thenm you 
> will get
> Cluster restart anyway. 
>
> The remaining issue is SI/SU failover, which could be delayed if an IMMND 
> crashes
> And is restarted at the same time. 
>
> We have also seen some cases recently where the IMMND is not crashed but 
> simply 
> frozen/hung. The heartbeat timeout is quite long since this should *never* 
> *Ever* happen.
> We still dont know why it happened in these cases, but I am pretty sure the 
> cause is
> Not inside the IMMND, but in a (modified) kernel. 
>   
Could possibly also have something to do with OpenSAF having TIPC 
execute with low importance
while some applications execute with TIPC higher importance levels.
With no load-regulaton of applications and no overload protection in 
OpenSAF...
But I would expect teh symptoms there to be loss of MDS messages rather 
than a frozen IMMND.

/AndersBj

> /AndersBj
>
>
> -----Original Message-----
> From: Hans Feldt [mailto:hans.fe...@ericsson.com] 
> Sent: den 4 oktober 2013 16:10
> To: praveen.malv...@oracle.com
> Cc: opensaf-devel@lists.sourceforge.net
> Subject: Re: [devel] [PATCH 0 of 4] Review Request for amf #574
>
> Could I please have some feedback on this?
> /Hans
>
> On 09/25/2013 04:55 PM, Hans Feldt wrote:
>   
>> Summary: reduce IMM reads by amfnd
>> Review request for Trac Ticket(s): 574 Peer Reviewer(s): Praveen, 
>> Nags, Hans N Pull request to: <<LIST THE PERSON WITH PUSH ACCESS 
>> HERE>> Affected branch(es): all Development branch: <<IF ANY GIVE THE 
>> REPO URL>>
>>
>> --------------------------------
>> Impacted area       Impact y/n
>> --------------------------------
>>   Docs                    n
>>   Build system            n
>>   RPM/packaging           n
>>   Configuration files     n
>>   Startup scripts         n
>>   SAF services            y
>>   OpenSAF services        n
>>   Core libraries          n
>>   Samples                 n
>>   Tests                   n
>>   Other                   n
>>
>>
>> Comments (indicate scope for each "y" above):
>> ---------------------------------------------
>>
>> changeset 78b3d50901d6a7e08a465cee0bc59325a788f923
>> Author:      Hans Feldt <hans.fe...@ericsson.com>
>> Date:        Wed, 25 Sep 2013 16:09:01 +0200
>>
>>      amf: cleanup dnd edu [#574]
>>
>>      Cosmetic cleanup of the dnd edu. Some comments and one (unused) offset 
>> was
>>      wrong. No functional impact.
>>
>> changeset e414e4ed8da51899cbc7e5ad2af6b5fa4e71520a
>> Author:      Hans Feldt <hans.fe...@ericsson.com>
>> Date:        Wed, 25 Sep 2013 16:10:12 +0200
>>
>>      amf: include and use sirank in SUSI msg [#574]
>>
>>      When the AMF node director receives the SUSI ASGN message, it reads the 
>> SI
>>      rank from IMM. If IMM does not respond in a timely manner the amfnd 
>> process
>>      will be aborted and the node restarted.
>>
>>      By including SI rank (an int) in the SUSI ASGN msg, the read from IMM 
>> can be
>>      skipped.
>>
>>      In service upgrade:
>>
>>      The AMF msg protocol version is bumped to 5. The AMF director supports 
>> all
>>      current (old) versions and now also version 5. The node director now 
>> only
>>      supports version 5 since it requires the SUSI message to contain the
>>      information required. This means the AMF director has to be upgraded 
>> first
>>      to support version 5 and then the AMF node directors can be upgraded.
>>
>> changeset a0d77c82e59e3e281f1831a81f99a32d8ee6581d
>> Author:      Hans Feldt <hans.fe...@ericsson.com>
>> Date:        Wed, 25 Sep 2013 16:11:50 +0200
>>
>>      amf: include and use comp capability in SUSI msg [#574]
>>
>>      When the AMF node director receives the SUSI ASGN message, for each CSI 
>> it
>>      reads the saAmfCSType attribute from IMM.
>>
>>      When the AMF node director receives the SUSI MOD message, for each CSI 
>> it
>>      reads the saAmfCtCompCapability attribute from the comp and csi 
>> associated
>>      SaAmfCtCsType instance.
>>
>>      In any of these cases if IMM does not respond in a timely manner the 
>> amfnd
>>      process will be aborted and the node restarted.
>>
>>      By including component capability (an int) in the SUSI ASG msg, these 
>> reads
>>      from IMM can be completely skipped in the node director.
>>
>>      In service upgrade: uses version 5 of the protocol
>>
>> changeset 05279a1132c2d150a2452f781090f3e2a15ecf62
>> Author:      Hans Feldt <hans.fe...@ericsson.com>
>> Date:        Wed, 25 Sep 2013 16:12:21 +0200
>>
>>      amf: include and use su_failover in REG_SU msg [#574]
>>
>>      When the AMF node director receives the REG_SU message, it reads the
>>      saAmfSUFailover attribute from IMM. If IMM does not respond in a timely
>>      manner the amfnd process will be aborted and the node restarted.
>>
>>      By including su_failover (an int) in the REG_SU msg, the read from IMM 
>> can
>>      be skipped.
>>
>>
>> Complete diffstat:
>> ------------------
>>   osaf/libs/common/avsv/avsv_d2nedu.c             |  117 
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------------
>>   osaf/libs/common/avsv/include/avsv_d2nmsg.h     |    6 +++++-
>>   osaf/services/saf/avsv/avd/avd_mds.c            |    8 +++++---
>>   osaf/services/saf/avsv/avd/avd_util.c           |   22 
>> ++++++++++++++++++++++
>>   osaf/services/saf/avsv/avd/include/avd_mds.h    |    2 +-
>>   osaf/services/saf/avsv/avnd/avnd.c              |    1 -
>>   osaf/services/saf/avsv/avnd/avnd_comp.c         |   53 
>> ++---------------------------------------------------
>>   osaf/services/saf/avsv/avnd/avnd_mds.c          |    9 +++++----
>>   osaf/services/saf/avsv/avnd/avnd_sidb.c         |   66 
>> ++----------------------------------------------------------------
>>   osaf/services/saf/avsv/avnd/avnd_sudb.c         |    1 +
>>   osaf/services/saf/avsv/avnd/include/avnd_comp.h |    2 +-
>>   osaf/services/saf/avsv/avnd/include/avnd_mds.h  |    4 ++--
>>   osaf/services/saf/avsv/avnd/include/avnd_su.h   |    1 +
>>   13 files changed, 114 insertions(+), 178 deletions(-)
>>
>>
>> Testing Commands:
>> -----------------
>>   cluster start, all nodes of same version (5)
>>   add application dynamically and unlock it
>>
>>
>> Testing, Expected Results:
>> --------------------------
>>   works fine
>>
>>   (have not tested an old amfnd v4 with a new amfd v5)
>>
>>
>> Conditions of Submission:
>> -------------------------
>>   ack from reviewer
>>
>>
>> Arch      Built     Started    Linux distro
>> -------------------------------------------
>> mips        n          n
>> mips64      n          n
>> x86         n          n
>> x86_64      y          y
>> powerpc     n          n
>> powerpc64   n          n
>>
>>
>> Reviewer Checklist:
>> -------------------
>> [Submitters: make sure that your review doesn't trigger any 
>> checkmarks!]
>>
>>
>> Your checkin has not passed review because (see checked entries):
>>
>> ___ Your RR template is generally incomplete; it has too many blank entries
>>      that need proper data filled in.
>>
>> ___ You have failed to nominate the proper persons for review and push.
>>
>> ___ Your patches do not have proper short+long header
>>
>> ___ You have grammar/spelling in your header that is unacceptable.
>>
>> ___ You have exceeded a sensible line length in your headers/comments/text.
>>
>> ___ You have failed to put in a proper Trac Ticket # into your commits.
>>
>> ___ You have incorrectly put/left internal data in your comments/files
>>      (i.e. internal bug tracking tool IDs, product names etc)
>>
>> ___ You have not given any evidence of testing beyond basic build tests.
>>      Demonstrate some level of runtime or other sanity testing.
>>
>> ___ You have ^M present in some of your files. These have to be removed.
>>
>> ___ You have needlessly changed whitespace or added whitespace crimes
>>      like trailing spaces, or spaces before tabs.
>>
>> ___ You have mixed real technical changes with whitespace and other
>>      cosmetic code cleanup changes. These have to be separate commits.
>>
>> ___ You need to refactor your submission into logical chunks; there is
>>      too much content into a single commit.
>>
>> ___ You have extraneous garbage in your review (merge commits etc)
>>
>> ___ You have giant attachments which should never have been sent;
>>      Instead you should place your content in a public tree to be pulled.
>>
>> ___ You have too many commits attached to an e-mail; resend as threaded
>>      commits, or place in a public tree for a pull.
>>
>> ___ You have resent this content multiple times without a clear indication
>>      of what has changed between each re-send.
>>
>> ___ You have failed to adequately and individually address all of the
>>      comments and change requests that were proposed in the initial review.
>>
>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)
>>
>> ___ Your computer have a badly configured date and time; confusing the
>>      the threaded patch review.
>>
>> ___ Your changes affect IPC mechanism, and you don't present any results
>>      for in-service upgradability test.
>>
>> ___ Your changes affect user manual and documentation, your patch series
>>      do not contain the patch that updates the Doxygen manual.
>>
>>
>> ----------------------------------------------------------------------
>> -------- October Webinars: Code for Performance Free Intel webinars 
>> can help you accelerate application performance.
>> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the 
>> most from the latest Intel processors and coprocessors. See abstracts 
>> and register > 
>> http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.c
>> lktrk _______________________________________________
>> Opensaf-devel mailing list
>> Opensaf-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>
>>
>>     
>
> ------------------------------------------------------------------------------
> October Webinars: Code for Performance
> Free Intel webinars can help you accelerate application performance.
> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
> the latest Intel processors and coprocessors. See abstracts and register > 
> http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk
> _______________________________________________
> Opensaf-devel mailing list
> Opensaf-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>
> ------------------------------------------------------------------------------
> October Webinars: Code for Performance
> Free Intel webinars can help you accelerate application performance.
> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
> the latest Intel processors and coprocessors. See abstracts and register >
> http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk
> _______________________________________________
> Opensaf-devel mailing list
> Opensaf-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>   



------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to