Re: [devel] [PATCH 1/1] amf: amfd should reset msg_id counter to avoid message ID mismatch [#2891]

Minh Hon Chau Tue, 10 Jul 2018 01:39:49 -0700

Hi Thuan,

Please see comments inline.


Thanks

Minh


On 10/07/18 18:16, Tran Thuan wrote:


Hi Nagu,

I just update the method to reproduce the issue reported by #2891.

It happens after tipc disconnect/connect from PL-3.

Minh raise about issue that SC back after headless then PL rebootafter info synced to SC.


@Minh: about your amfd trace. I would like to confirm some points:

  * Why CLM callback receive MDS down at 14:54:28 but later receive
    amfnd up??? Then until 14:58:28 receive amfnd down?

Also I don’t see trace “avd_mds_avnd_down_evh: 2040f, (nil)” asanalysis of ticket’s issue.


  * Do you reproduce with official opensaf code (no hack code), just
    reboot PL-4 after SC back from headless?

[Minh]: I did return in avd_mds_avnd_down_evh() because I would like tosee how amfd reacts if CLM callback comes first (I haven't ever beenable to reproduce this problem). But does it matter? The point is that,if reboot a PL, both CLM callback/MDS down can come before a PL joinsAMF cluster and those events also can come after amfd processes syncedassignments because of the sync window.


 *


  * Do you have any suggestion for solution?

Current code:
  static void clm_track_cb(
...
          } else if (node->node_state == AVD_AVND_STATE_ABSENT) {

LOG_IN("%s: CLM node '%s' is not an AMF cluster member; MDSdown received",

                   __FUNCTION__, node_name.c_str());
            avd_node_delete_nodeid(node);
            goto done;
          }
...

The *if* is not strong enough to represent that MDS down had beenreceived. I think you can use @amfnd_svc_db (or others) to indicate thatMDS down really did came. And you need to delete nodeId out ofnode_id_db too.

<143>1 2018-07-10T14:54:27.369871+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="11063"] 297:amf/amfd/siass.cc:1127 >>avd_susi_recreate: msg_id: 0 node_id: 132111 num_sisu: 4

<143>1 2018-07-10T14:54:27.377689+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="11320"] 297:amf/amfd/siass.cc:1195 << avd_susi_recreate

<142>1 2018-07-10T14:54:28.270464+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="12725"] 297:amf/amfd/clm.cc:320 IN clm_track_cb: CLMnode 'safNode=PL-4,safCluster=myClmCluster' is not an AMF clustermember; MDS down received

<143>1 2018-07-10T14:54:31.16261+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="17815"] 297:amf/amfd/ndfsm.cc:752 TR amfnd on 2040fis up

<143>1 2018-07-10T14:54:31.422545+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="17886"] 297:amf/amfd/ndfsm.cc:296 >>avd_node_up_evh: from 2040f, safAmfNode=PL-4,safAmfCluster=myAmfCluster

<143>1 2018-07-10T14:54:31.422868+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="17899"] 297:amf/amfd/node.cc:461 >>avd_node_state_set: 'safAmfNode=PL-4,safAmfCluster=myAmfCluster'ABSENT => NO_CONFIG

<143>1 2018-07-10T14:54:31.42297+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="17904"] 297:amf/amfd/node.cc:466 << avd_node_state_set

<143>1 2018-07-10T14:54:31.423114+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="17910"] 297:amf/amfd/ndfsm.cc:536 << avd_node_up_evh

<143>1 2018-07-10T14:54:31.576633+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="18023"] 297:amf/amfd/node.cc:461 >>avd_node_state_set: 'safAmfNode=PL-4,safAmfCluster=myAmfCluster'NO_CONFIG => NCS_INIT

<143>1 2018-07-10T14:54:31.576689+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="18028"] 297:amf/amfd/node.cc:466 << avd_node_state_set

<143>1 2018-07-10T14:58:28.032461+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="21836"] 300:amf/amfd/mds.cc:463 TR avnd2040fe497957d down

<143>1 2018-07-10T14:58:28.032592+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="21837"] 297:amf/amfd/main.cc:766 >> process_event:evt->rcv_evt 21

<143>1 2018-07-10T14:58:28.032613+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="21838"] 297:amf/amfd/ndfsm.cc:774 >>avd_mds_avnd_down_evh: 2040f, 0x55b62f9dcff0

<143>1 2018-07-10T14:58:28.032641+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="21839"] 297:amf/amfd/ndproc.cc:1219 >>avd_node_failover: 'safAmfNode=PL-4,safAmfCluster=myAmfCluster'

<143>1 2018-07-10T14:58:28.032656+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="21840"] 297:amf/amfd/ndfsm.cc:1113 >>avd_node_mark_absent

<143>1 2018-07-10T14:58:28.032679+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="21841"] 297:amf/amfd/imm.cc:1887 >>avd_saImmOiRtObjectUpdate:'safAmfNode=PL-4,safAmfCluster=myAmfCluster' saAmfNodeOperState

<143>1 2018-07-10T14:58:28.032722+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="21842"] 297:amf/amfd/imm.cc:1917 <<avd_saImmOiRtObjectUpdate

<143>1 2018-07-10T14:58:28.032738+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="21843"] 297:amf/amfd/node.cc:461 >>avd_node_state_set: 'safAmfNode=PL-4,safAmfCluster=myAmfCluster'NCS_INIT => ABSENT

<143>1 2018-07-10T14:58:28.032822+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="21848"] 297:amf/amfd/node.cc:466 << avd_node_state_set

<143>1 2018-07-10T14:58:28.033059+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="21858"] 297:amf/amfd/ndfsm.cc:1134 <<avd_node_mark_absent

<143>1 2018-07-10T14:58:28.033102+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="21861"] 297:amf/amfd/sgproc.cc:2177 >>avd_node_down_mw_susi_failover:'safAmfNode=PL-4,safAmfCluster=myAmfCluster'

<143>1 2018-07-10T14:58:28.060869+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="22222"] 297:amf/amfd/sgproc.cc:2223 <<avd_node_down_mw_susi_failover

<143>1 2018-07-10T14:58:28.06088+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="22223"] 297:amf/amfd/sgproc.cc:2238 >>avd_node_down_appl_susi_failover:'safAmfNode=PL-4,safAmfCluster=myAmfCluster'

<143>1 2018-07-10T14:58:28.076943+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="22554"] 297:amf/amfd/sgproc.cc:2311 <<avd_node_down_appl_susi_failover

<143>1 2018-07-10T14:58:28.077075+10:00 SC-1 osafamfd 297 osafamfd[meta sequenceId="22562"] 297:amf/amfd/ndproc.cc:1234 << avd_node_failover


*Best Regards,*

*Thuan ***

*From:* nagen...@hasolutions.in <nagen...@hasolutions.in>
*Sent:* Tuesday, July 10, 2018 1:36 PM

*To:* Minh Hon Chau <minh.c...@dektech.com.au>; Tran Thuan<thuan.t...@dektech.com.au>; hans.nordeb...@ericsson.com;gary....@dektech.com.au

*Cc:* opensaf-devel@lists.sourceforge.net

*Subject:* RE: Re: [PATCH 1/1] amf: amfd should reset msg_id counterto avoid message ID mismatch [#2891]


Hi Thuan,

I agree with Minh. We need to know why Node lefthas come before and Mds down came later and about other flows. I wouldbe great if you could please reproduce the scenario and upload thetraces of payload and controller(Please keep the time in sync) in theticket.


Thanks,

Nagendra, 91-9866424860

www.hasolutions.in <http://www.hasolutions.in>

https://www.linkedin.com/company/hasolutions/

High Availability Solutions Pvt. Ltd.

- OpenSAF support and services

    --------- Original Message ---------

    Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to
    avoid message ID mismatch [#2891]
    From: "Minh Hon Chau" <minh.c...@dektech.com.au
    <mailto:minh.c...@dektech.com.au>>
    Date: 7/10/18 10:49 am
    To: "Tran Thuan" <thuan.t...@dektech.com.au
    <mailto:thuan.t...@dektech.com.au>>, nagen...@hasolutions.in
    <mailto:nagen...@hasolutions.in>, hans.nordeb...@ericsson.com
    <mailto:hans.nordeb...@ericsson.com>, gary....@dektech.com.au
    <mailto:gary....@dektech.com.au>
    Cc: opensaf-devel@lists.sourceforge.net
    <mailto:opensaf-devel@lists.sourceforge.net>

    Hi again,

    I am not sure how you replicate this scenario. It looks to me you
    deliberately reboot the PL-3 before amfd process the synced
    assignment.

    The node_up sequence has many branches, you are replicating one of
    those.

    What I mean, at the time amfd calls
    avd_process_state_info_queue(), PL
    has not joined the AMF cluster and PL keeps resending node_up message
    for every one second.

    When amfd finishes creating synced assignment, the node sync
    windows is
    still open, the cb->init_state has not moved to APP_STATE, so the
    node_up of PL is still acceptable. At this time a PL can be reboot by
    someone.

    For your reference, I attach to ticket the full trace of amfd, I
    have to
    turn off PL-3 to trigger the node sync window up 10 seconds, I reboot
    PL-4. Please note this scenario of node sync timeout did happen at
    application sides.

    Reseting the counters in CLM callback is easy way, but it's not fully
    fixed and not make us safe from the issue.

    Thanks

    Minh


    On 10/07/18 14:00, Tran Thuan wrote:
    > Hi Minh,
    >
    > I check your concern about if PL down before sending node up but
    after synced assignments (recover after headless).
    > Current code still work correctly. Because active AMFD don't
    recreate assignments.
    > So, my point of view is still there, just reset 3 fields as
    current patch.
    >
    > Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO Synced
    SISU:safSi=NoRed2,safApp=OpenSAF,safSu=PL-3,safSg=NoRed,safApp=OpenSAF
    <1, 3>
    > Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO Synced
    SU:safSu=PL-3,safSg=NoRed,safApp=OpenSAF <0, 1, 3>
    > Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO 8 CSICOMP states sent
    > Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO 8 COMP states sent
    > Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO Sending node up due
    to NCSMDS_NEW_ACTIVE
    > Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO
    ====================delay to power off this node ====================
    > Jul 10 13:40:05 PL-2-3 osafckptnd[5984]: NO CLM selection object
    was updated. (17)
    > Jul 10 13:40:06 PL-2-3 osafamfnd[5971]: NO
    ====================delay to power off this node ====================
    > Jul 10 13:40:07 PL-2-3 osafamfnd[5971]: NO
    ====================delay to power off this node ====================
    > Jul 10 13:40:08 PL-2-3 osafamfnd[5971]: NO
    ====================delay to power off this node ====================
    >
    > <143>1 2018-07-10T13:40:15.365252+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13233"] 4827:amf/amfd/ndfsm.cc:296 >>
    avd_node_up_evh: from 2010f,
    safAmfNode=SC-1,safAmfCluster=myAmfCluster
    > <143>1 2018-07-10T13:40:15.365259+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13234"] 4827:amf/amfd/node.cc:231 >>
    avd_node_get: safAmfNode=SC-1,safAmfCluster=myAmfCluster
    > <143>1 2018-07-10T13:40:15.365263+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13235"] 4827:amf/amfd/node.cc:232 <<
    avd_node_get
    > <143>1 2018-07-10T13:40:15.365314+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13236"] 4827:amf/amfd/ndfsm.cc:216 >>
    avd_count_node_up
    > <143>1 2018-07-10T13:40:15.36532+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13237"] 4827:amf/amfd/ndfsm.cc:226 TR
    Number of node director(s) that director received node_up msg:0
    > <143>1 2018-07-10T13:40:15.365325+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13238"] 4827:amf/amfd/ndfsm.cc:228 <<
    avd_count_node_up
    > <143>1 2018-07-10T13:40:15.365329+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13239"] 4827:amf/amfd/ndfsm.cc:353 TR
    NodeSync timer is active, ignore this node_up msg (nodeid:2010f)
    > <143>1 2018-07-10T13:40:15.365334+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13240"] 4827:amf/amfd/ndfsm.cc:534 <<
    avd_node_up_evh
    >
    > <143>1 2018-07-10T13:40:16.462939+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13280"] 4827:amf/amfd/ndfsm.cc:296 >>
    avd_node_up_evh: from 2010f,
    safAmfNode=SC-1,safAmfCluster=myAmfCluster
    > <141>1 2018-07-10T13:40:16.462975+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13281"] 4827:amf/amfd/ndfsm.cc:254 NO
    Received node_up from 2010f: msg_id 1
    > <143>1 2018-07-10T13:40:16.462981+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13282"] 4827:amf/amfd/ndfsm.cc:63 >>
    avd_process_state_info_queue
    > <143>1 2018-07-10T13:40:16.462986+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13283"] 4827:amf/amfd/ndfsm.cc:65 TR
    queue_size before processing: 2
    > <143>1 2018-07-10T13:40:16.463162+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13284"] 4827:amf/amfd/ndfsm.cc:73 TR
    rcv_evt: 12
    > <143>1 2018-07-10T13:40:16.463175+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13285"] 4827:amf/amfd/ndfsm.cc:79 TR
    msg_type: 31
    > <143>1 2018-07-10T13:40:16.463203+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13286"] 4827:amf/amfd/siass.cc:1132 >>
    avd_susi_recreate: msg_id: 0 node_id: 131855 num_sisu: 3
    > <139>1 2018-07-10T13:40:16.463224+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13287"] 4827:amf/amfd/siass.cc:1140 ER
    Node 2030f has left the cluster
    > <143>1 2018-07-10T13:40:16.463258+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13288"] 4827:amf/amfd/siass.cc:0 <<
    avd_susi_recreate
    > <143>1 2018-07-10T13:40:16.463273+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13289"] 4827:amf/common/d2nmsg.c:931 >>
    avsv_free_n2d_nd_sisu_state_info
    > <143>1 2018-07-10T13:40:16.463278+10:00 SC-2-1 osafamfd 4827
    osafamfd [meta sequenceId="13290"] 4827:amf/common/d2nmsg.c:953 TR
    3 SISU records to free
    >
    > Best Regards,
    > Thuan
    >
    > -----Original Message-----
    > From: Minh Hon Chau <minh.c...@dektech.com.au
    <mailto:minh.c...@dektech.com.au>>
    > Sent: Monday, July 9, 2018 4:04 PM
    > To: Tran Thuan <thuan.t...@dektech.com.au
    <mailto:thuan.t...@dektech.com.au>>; nagen...@hasolutions.in
    <mailto:nagen...@hasolutions.in>; hans.nordeb...@ericsson.com
    <mailto:hans.nordeb...@ericsson.com>; gary....@dektech.com.au
    <mailto:gary....@dektech.com.au>
    > Cc: opensaf-devel@lists.sourceforge.net
    <mailto:opensaf-devel@lists.sourceforge.net>
    > Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter
    to avoid message ID mismatch [#2891]
    >
    > Hi,
    >
    > How about the @nodes_exit_cnt? The active amfd checkpoints this
    @nodes_exit_cnt to the standby amfd inside mds down event. You can
    say it's not used but it's what's happening for this counter.
    >
    > "AMFND not yet up", what you mean is node_state
    AVD_AVND_STATE_ABSENT?
    > You reset counter if node state as AVD_AVND_STATE_ABSENT in clm
    callback.
    > If a payload is coming from headless, it syncs the assignments
    to the active amfd first and enters node_up sequence. This node_up
    sequence can take up to 10 seconds before this payload becomes
    AVD_AVND_STATE_PRESENT.
    > If this issue happens during this 10 seconds, the node_state is
    still AVD_AVND_STATE_ABSENT and active amfd does not process mds
    down event properly, how could amfd remove the synced assignment?
    >
    > As mentioned, the problem is not that the active amfd is missing
    1, 2 or
    > 3 counters to be reset, it misses to process the mds down event
    properly.
    >
    > Are you able to reproduce this problem?
    >
    > Thanks
    > Minh
    >
    > On 09/07/18 16:33, Tran Thuan wrote:
    >> Hi Minh,
    >>
    >> Since AMFND not yet up before, other fields are not updated yet.
    >> I think only 3 fields is changed in this scenario, that's why
    my is reset them only.
    >> Do you think so?
    >>
    >> Best Regards,
    >> Thuan
    >>
    >> -----Original Message-----
    >> From: Minh Hon Chau <minh.c...@dektech.com.au
    <mailto:minh.c...@dektech.com.au>>
    >> Sent: Monday, July 9, 2018 1:22 PM
    >> To: Tran Thuan <thuan.t...@dektech.com.au
    <mailto:thuan.t...@dektech.com.au>>; nagen...@hasolutions.in
    <mailto:nagen...@hasolutions.in>;
    >> hans.nordeb...@ericsson.com
    <mailto:hans.nordeb...@ericsson.com>; gary....@dektech.com.au
    <mailto:gary....@dektech.com.au>
    >> Cc: opensaf-devel@lists.sourceforge.net
    <mailto:opensaf-devel@lists.sourceforge.net>
    >> Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to
    >> avoid message ID mismatch [#2891]
    >>
    >> Hi,
    >>
    >> Ok, so we can ignore the "AMFND already up" case.
    >>
    >> "- AMFND not yet up before (node reboot up/first up): this is
    our issue.
    >> + current only delete node without care msg_id counter."
    >>
    >> Are you able to reproduce this problem?
    >> You only reset the msg_id counter of a node, how about the others:
    >> adest, recvr_fail_sw, node_info.initialViewNumber,
    nodes_exit_cnt,..
    >> they all are checked in the other places.
    >> Also, when the node goes down, the standby resets all of above
    variables directly, while the active doesn't as of this issue or
    only resets msg_id counters within your patch. The active and
    standby may not be consistent.
    >>
    >> Another thing you may consider, if any new/old
    attributes/variables of AVD_AVND is going to be added/removed, we
    only look at one place - event of MDS down, don't have to look
    at/reset those ones in the CLM callback.
    >>
    >> Thanks
    >> Minh
    >>
    >> On 09/07/18 12:12, Tran Thuan wrote:
    >>> Hi Minh,
    >>>
    >>> OK, so CLM callback can handle before/after AMFND down.
    >>> In case AMFND down then CLM callback, current flow works
    correctly.
    >>> In case CLM callback then AMFND down:
    >>> - AMFND already up before: current flow will do
    clm_node_exit_complete(), still works correctly.
    >>> - AMFND not yet up before (node reboot up/first up): this is
    our issue.
    >>> + current only delete node without care msg_id counter.
    >>>
    >>> Best Regards,
    >>> Thuan
    >>>
    >>> -----Original Message-----
    >>> From: Minh Hon Chau <minh.c...@dektech.com.au
    <mailto:minh.c...@dektech.com.au>>
    >>> Sent: Monday, July 9, 2018 8:52 AM
    >>> To: Tran Thuan <thuan.t...@dektech.com.au
    <mailto:thuan.t...@dektech.com.au>>; nagen...@hasolutions.in
    <mailto:nagen...@hasolutions.in>;
    >>> hans.nordeb...@ericsson.com
    <mailto:hans.nordeb...@ericsson.com>; gary....@dektech.com.au
    <mailto:gary....@dektech.com.au>
    >>> Cc: opensaf-devel@lists.sourceforge.net
    <mailto:opensaf-devel@lists.sourceforge.net>
    >>> Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to
    >>> avoid message ID mismatch [#2891]
    >>>
    >>> This problem possibly happens in normal situation, not only
    "AMFND not yet up", so seeing it as a race and can also fixes the
    problem in case of "AMFND already up"
    >>>
    >>>
    >>> On 09/07/18 11:41, Minh Hon Chau wrote:
    >>>> Hi Thuan,
    >>>>
    >>>> I often see mds down coming before clm callback in my test, the
    >>>> problem in this ticket has those events coming reverse order.
    >>>>
    >>>> <143>1 2018-07-09T11:19:22.185279+10:00 SC-1 osafamfd 372
    osafamfd
    >>>> [meta sequenceId="23028"] 372:amf/amfd/ndfsm.cc:772 >>
    >>>> avd_mds_avnd_down_evh: 2030f, 0x55895024e410
    >>>> <143>1 2018-07-09T11:19:22.185303+10:00 SC-1 osafamfd 372
    osafamfd
    >>>> [meta sequenceId="23029"] 372:amf/amfd/ndproc.cc:1219 >>
    >>>> avd_node_failover: 'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
    >>>> <143>1 2018-07-09T11:19:22.185324+10:00 SC-1 osafamfd 372
    osafamfd
    >>>> [meta sequenceId="23030"] 372:amf/amfd/ndfsm.cc:1111 >>
    >>>> avd_node_mark_absent ...
    >>>> <143>1 2018-07-09T11:19:22.253699+10:00 SC-1 osafamfd 372
    osafamfd
    >>>> [meta sequenceId="23890"] 372:amf/amfd/ndfsm.cc:821 <<
    >>>> avd_mds_avnd_down_evh
    >>>>
    >>>> <143>1 2018-07-09T11:19:22.272188+10:00 SC-1 osafamfd 372
    osafamfd
    >>>> [meta sequenceId="23917"] 372:clm/agent/clma_api.cc:786 >>
    >>>> saClmDispatch ...
    >>>> <143>1 2018-07-09T11:19:22.272268+10:00 SC-1 osafamfd 372
    osafamfd
    >>>> [meta sequenceId="23926"] 372:amf/amfd/clm.cc:223 >>
    clm_track_cb: '0'
    >>>> '4' '1'
    >>>> <143>1 2018-07-09T11:19:22.272277+10:00 SC-1 osafamfd 372
    osafamfd
    >>>> [meta sequenceId="23927"] 372:amf/amfd/clm.cc:239 TR
    >>>> numberOfMembers:'4', numberOfItems:'1'
    >>>> <143>1 2018-07-09T11:19:22.272285+10:00 SC-1 osafamfd 372
    osafamfd
    >>>> [meta sequenceId="23928"] 372:amf/amfd/clm.cc:245 TR i = 0,
    >>>> node:'safNode=PL-3,safCluster=myClmCluster', clusterChange:3
    >>>> <142>1 2018-07-09T11:19:22.272294+10:00 SC-1 osafamfd 372
    osafamfd
    >>>> [meta sequenceId="23929"] 372:amf/amfd/clm.cc:320 IN
    clm_track_cb:
    >>>> CLM node 'safNode=PL-3,safCluster=myClmCluster' is not an AMF
    >>>> cluster member; MDS down received
    >>>> <143>1 2018-07-09T11:19:22.272303+10:00 SC-1 osafamfd 372
    osafamfd
    >>>> [meta sequenceId="23930"] 372:amf/amfd/clm.cc:443 <<
    clm_track_cb ...
    >>>> <143>1 2018-07-09T11:19:22.272352+10:00 SC-1 osafamfd 372
    osafamfd
    >>>> [meta sequenceId="23936"] 372:clm/agent/clma_api.cc:834 <<
    >>>> saClmDispatch
    >>>>
    >>>>
    >>>> Reset counter can avoid the msg mismatch, but I'm not sure it's
    >>>> completely fixed since the mds down event is not handled
    properly.
    >>>> If you have mds down event handled properly in both cases, will
    >>>> there be this msg mismatch problem?
    >>>>
    >>>> Thanks,
    >>>> Minh
    >>>>
    >>>> On 09/07/18 11:11, Tran Thuan wrote:
    >>>>> Hi Minh,
    >>>>>
    >>>>> If I understand correctly, CLM callback always before AMFND
    down/up
    >>>>> event.
    >>>>> The problem is just because AMFND not yet up and CLM callback
    >>>>> receive MDS down of node and delete node.
    >>>>> After that AMFND down event come cannot find the node,
    cannot find
    >>>>> the node is expected I guess.
    >>>>> So, we only need to reset counter to make thing right, what
    do you
    >>>>> think?
    >>>>>
    >>>>> <143>1 2018-07-03T18:40:27.239707+10:00 SC-2-1 osafamfd 4815
    >>>>> osafamfd [meta sequenceId="2448"] 4815:amf/amfd/clm.cc:223
    >> clm_track_cb:
    >>>>> '0' '4' '1'
    >>>>> <142>1 2018-07-03T18:40:27.239778+10:00 SC-2-1 osafamfd 4815
    >>>>> osafamfd [meta sequenceId="2451"] 4815:amf/amfd/clm.cc:299
    IN clm_track_cb:
    >>>>> CLM node 'safNode=PL-2-3,safCluster=myClmCluster' is not an AMF
    >>>>> cluster member; MDS down received
    >>>>> <143>1 2018-07-03T18:40:27.239792+10:00 SC-2-1 osafamfd 4815
    >>>>> osafamfd [meta sequenceId="2452"] 4815:amf/amfd/clm.cc:418 <<
    >>>>> clm_track_cb
    >>>>> <143>1 2018-07-03T18:40:27.243287+10:00 SC-2-1 osafamfd 4815
    >>>>> osafamfd [meta sequenceId="2465"] 4815:amf/amfd/ndfsm.cc:772 >>
    >>>>> avd_mds_avnd_down_evh: 2030f, (nil)
    >>>>> <143>1 2018-07-03T18:40:27.243307+10:00 SC-2-1 osafamfd 4815
    >>>>> osafamfd [meta sequenceId="2466"] 4815:amf/amfd/ndfsm.cc:821 <<
    >>>>> avd_mds_avnd_down_evh
    >>>>>
    >>>>> <143>1 2018-07-03T18:41:37.965003+10:00 SC-2-1 osafamfd 4815
    >>>>> osafamfd [meta sequenceId="2596"] 4815:amf/amfd/clm.cc:223
    >> clm_track_cb:
    >>>>> '0' '4' '1'
    >>>>> <143>1 2018-07-03T18:41:37.965104+10:00 SC-2-1 osafamfd 4815
    >>>>> osafamfd [meta sequenceId="2605"] 4815:amf/amfd/clm.cc:418 <<
    >>>>> clm_track_cb
    >>>>> <143>1 2018-07-03T18:41:45.016379+10:00 SC-2-1 osafamfd 4815
    >>>>> osafamfd [meta sequenceId="2631"] 4815:amf/amfd/ndfsm.cc:750 TR
    >>>>> amfnd on 2030f is up
    >>>>> <143>1 2018-07-03T18:41:45.139262+10:00 SC-2-1 osafamfd 4815
    >>>>> osafamfd [meta sequenceId="2686"] 4815:amf/amfd/ndfsm.cc:772 >>
    >>>>> avd_mds_avnd_down_evh: 2030f, 0x7ff9ac3dc940
    >>>>> <143>1 2018-07-03T18:41:45.141883+10:00 SC-2-1 osafamfd 4815
    >>>>> osafamfd [meta sequenceId="3008"] 4815:amf/amfd/ndfsm.cc:821 <<
    >>>>> avd_mds_avnd_down_evh
    >>>>>
    >>>>> Best Regards,
    >>>>> Thuan
    >>>>>
    >>>>>
    >>>>> -----Original Message-----
    >>>>> From: Minh Hon Chau <minh.c...@dektech.com.au
    <mailto:minh.c...@dektech.com.au>>
    >>>>> Sent: Monday, July 9, 2018 7:15 AM
    >>>>> To: thuan.tran <thuan.t...@dektech.com.au
    <mailto:thuan.t...@dektech.com.au>>;
    >>>>> nagen...@hasolutions.in <mailto:nagen...@hasolutions.in>;
    hans.nordeb...@ericsson.com <mailto:hans.nordeb...@ericsson.com>;
    >>>>> gary....@dektech.com.au <mailto:gary....@dektech.com.au>
    >>>>> Cc: opensaf-devel@lists.sourceforge.net
    <mailto:opensaf-devel@lists.sourceforge.net>
    >>>>> Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id
    counter to
    >>>>> avoid message ID mismatch [#2891]
    >>>>>
    >>>>> Hi,
    >>>>>
    >>>>> In normal situation when I stop Opensaf or reboot the node,
    I often
    >>>>> get MDS down first, then CLM track callback. [1]
    >>>>>
    >>>>> The problem in this ticket is because AMFD has CLM track
    callback
    >>>>> coming before MDS down. [2]
    >>>>>
    >>>>> We need to see it as a race condition, not because of
    missing the
    >>>>> counters reset, so that the amfd's function to deal with MDS
    down
    >>>>> should be called. That would be equivalent to [1], which is
    >>>>> currently working.
    >>>>>
    >>>>> With this patch, we still don't have avd_mds_avnd_down_evh()
    to be
    >>>>> called, that function processes many things in the event of
    mds down.
    >>>>> (ie. @nodes_exit_cnt, ...)
    >>>>>
    >>>>> I think the node_info.member should be set TRUE or FALSE in the
    >>>>> event of CLM callback since it means CLM membership. We can use
    >>>>> both @member and one of variables indicating MDS down, to
    trap each
    >>>>> other in the events of MDS down and CLM callback, so that in the
    >>>>> end we have
    >>>>> avd_mds_avnd_down_evh() and nodeid is removed out of
    node_id_db in
    >>>>> both [1] and [2].
    >>>>>
    >>>>> Thanks,
    >>>>> Minh
    >>>>> On 06/07/18 20:36, thuan.tran wrote:
    >>>>>> There is a case that after AMFD send reboot order due to
    “out of
    >>>>>> sync window”.
    >>>>>> AMFD receive CLM track callback but node is not member yet and
    >>>>>> delete node.
    >>>>>> Later AMFND MDS down will not reset msg_id counter since it
    cannot
    >>>>>> find node.
    >>>>>> When node reboot up, AMFD continue use current msg_id
    counter send
    >>>>>> to AMFND cause messasge ID mismatch in AMFND then AMFND order
    >>>>>> reboot itself node.
    >>>>>> ---
    >>>>>> src/amf/amfd/clm.cc | 7 ++++++-
    >>>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
    >>>>>>
    >>>>>> diff --git a/src/amf/amfd/clm.cc b/src/amf/amfd/clm.cc index
    >>>>>> e113a65f9..4a15d5ad7 100644
    >>>>>> --- a/src/amf/amfd/clm.cc
    >>>>>> +++ b/src/amf/amfd/clm.cc
    >>>>>> @@ -316,9 +316,14 @@ static void clm_track_cb(
    >>>>>> __FUNCTION__, node_name.c_str());
    >>>>>> goto done;
    >>>>>> } else if (node->node_state == AVD_AVND_STATE_ABSENT) {
    >>>>>> - LOG_IN("%s: CLM node '%s' is not an AMF cluster member;
    >>>>>> MDS down received",
    >>>>>> + LOG_IN("%s: CLM node '%s' is ABSENT; MDS down
    >>>>>> +received",
    >>>>>> __FUNCTION__, node_name.c_str());
    >>>>>> avd_node_delete_nodeid(node);
    >>>>>> + /* Reset msg_id because AVND MDS down may come later
    >>>>>> + and cannot find node to reset these, cause message
    >>>>>> + ID
    >>>>>> mismatch. */
    >>>>>> + node->rcv_msg_id = 0;
    >>>>>> + node->snd_msg_id = 0;
    >>>>>> + node->node_info.member = SA_FALSE;
    >>>>>> goto done;
    >>>>>> }
    >>>>>> TRACE(" Node Left: rootCauseEntity %s for node %u",
    >>
    >
    >


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1/1] amf: amfd should reset msg_id counter to avoid message ID mismatch [#2891]

Reply via email to