You need to make sure that @node is properly removed out of node_id_db too.

amfd calls avd_node_find_nodeid() in many places, something could go wrong if @node is duplicated in node_id_db.


On 10/07/18 18:46, Tran Thuan wrote:

Hi Minh,

What do you think about this?

  static void clm_track_cb(
...
          } else if (node->node_state == AVD_AVND_STATE_ABSENT) {
            LOG_IN("%s: CLM node '%s' is not an AMF cluster member; MDS down received",
                   __FUNCTION__, node_name.c_str());
            //avd_node_delete_nodeid(node);    [Thuan]: Don’t delete node and let avnd_down() do proper stuff
            goto done;
          }
...

*Best Regards,*

*Thuan ***

*From:*Minh Hon Chau <minh.c...@dektech.com.au>
*Sent:* Tuesday, July 10, 2018 3:39 PM
*To:* Tran Thuan <thuan.t...@dektech.com.au>; nagen...@hasolutions.in; hans.nordeb...@ericsson.com; gary....@dektech.com.au
*Cc:* opensaf-devel@lists.sourceforge.net
*Subject:* Re: [PATCH 1/1] amf: amfd should reset msg_id counter to avoid message ID mismatch [#2891]

Hi Thuan,

Please see comments inline.

Thanks

Minh

On 10/07/18 18:16, Tran Thuan wrote:

    Hi Nagu,

    I just update the method to reproduce the issue reported by #2891.

    It happens after tipc disconnect/connect from PL-3.

    Minh raise about issue that SC back after headless then PL reboot
    after info synced to SC.

    @Minh: about your amfd trace. I would like to confirm some points:

      * Why CLM callback receive MDS down at 14:54:28 but later
        receive amfnd up??? Then until 14:58:28 receive amfnd down?

    Also I don’t see trace “avd_mds_avnd_down_evh: 2040f, (nil)” as
    analysis of ticket’s issue.

      * Do you reproduce with official opensaf code (no hack code),
        just reboot PL-4 after SC back from headless?

[Minh]: I did return in avd_mds_avnd_down_evh() because I would like to see how amfd reacts if CLM callback comes first (I haven't ever been able to reproduce this problem). But does it matter? The point is that, if reboot a PL, both CLM callback/MDS down can come before a PL joins AMF cluster and those events also can come after amfd processes synced assignments because of the sync window.

     *
      * Do you have any suggestion for solution?

Current code:
  static void clm_track_cb(
...
          } else if (node->node_state == AVD_AVND_STATE_ABSENT) {
            LOG_IN("%s: CLM node '%s' is not an AMF cluster member; MDS down received",
                   __FUNCTION__, node_name.c_str());
            avd_node_delete_nodeid(node);
            goto done;
          }
...
The *if* is not strong enough to represent that MDS down had been received. I think you can use @amfnd_svc_db (or others) to indicate that MDS down really did came. And you need to delete nodeId out of node_id_db too.

     *

    <143>1 2018-07-10T14:54:27.369871+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="11063"] 297:amf/amfd/siass.cc:1127 >>
    avd_susi_recreate: msg_id: 0 node_id: 132111 num_sisu: 4

    <143>1 2018-07-10T14:54:27.377689+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="11320"] 297:amf/amfd/siass.cc:1195 <<
    avd_susi_recreate

    <142>1 2018-07-10T14:54:28.270464+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="12725"] 297:amf/amfd/clm.cc:320 IN clm_track_cb:
    CLM node 'safNode=PL-4,safCluster=myClmCluster' is not an AMF
    cluster member; MDS down received

    <143>1 2018-07-10T14:54:31.16261+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="17815"] 297:amf/amfd/ndfsm.cc:752 TR amfnd on
    2040f is up

    <143>1 2018-07-10T14:54:31.422545+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="17886"] 297:amf/amfd/ndfsm.cc:296 >>
    avd_node_up_evh: from 2040f,
    safAmfNode=PL-4,safAmfCluster=myAmfCluster

    <143>1 2018-07-10T14:54:31.422868+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="17899"] 297:amf/amfd/node.cc:461 >>
    avd_node_state_set: 'safAmfNode=PL-4,safAmfCluster=myAmfCluster'
    ABSENT => NO_CONFIG

    <143>1 2018-07-10T14:54:31.42297+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="17904"] 297:amf/amfd/node.cc:466 <<
    avd_node_state_set

    <143>1 2018-07-10T14:54:31.423114+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="17910"] 297:amf/amfd/ndfsm.cc:536 <<
    avd_node_up_evh

    <143>1 2018-07-10T14:54:31.576633+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="18023"] 297:amf/amfd/node.cc:461 >>
    avd_node_state_set: 'safAmfNode=PL-4,safAmfCluster=myAmfCluster'
    NO_CONFIG => NCS_INIT

    <143>1 2018-07-10T14:54:31.576689+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="18028"] 297:amf/amfd/node.cc:466 <<
    avd_node_state_set

    <143>1 2018-07-10T14:58:28.032461+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="21836"] 300:amf/amfd/mds.cc:463 TR avnd
    2040fe497957d down

    <143>1 2018-07-10T14:58:28.032592+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="21837"] 297:amf/amfd/main.cc:766 >>
    process_event: evt->rcv_evt 21

    <143>1 2018-07-10T14:58:28.032613+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="21838"] 297:amf/amfd/ndfsm.cc:774 >>
    avd_mds_avnd_down_evh: 2040f, 0x55b62f9dcff0

    <143>1 2018-07-10T14:58:28.032641+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="21839"] 297:amf/amfd/ndproc.cc:1219 >>
    avd_node_failover: 'safAmfNode=PL-4,safAmfCluster=myAmfCluster'

    <143>1 2018-07-10T14:58:28.032656+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="21840"] 297:amf/amfd/ndfsm.cc:1113 >>
    avd_node_mark_absent

    <143>1 2018-07-10T14:58:28.032679+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="21841"] 297:amf/amfd/imm.cc:1887 >>
    avd_saImmOiRtObjectUpdate:
    'safAmfNode=PL-4,safAmfCluster=myAmfCluster' saAmfNodeOperState

    <143>1 2018-07-10T14:58:28.032722+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="21842"] 297:amf/amfd/imm.cc:1917 <<
    avd_saImmOiRtObjectUpdate

    <143>1 2018-07-10T14:58:28.032738+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="21843"] 297:amf/amfd/node.cc:461 >>
    avd_node_state_set: 'safAmfNode=PL-4,safAmfCluster=myAmfCluster'
    NCS_INIT => ABSENT

    <143>1 2018-07-10T14:58:28.032822+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="21848"] 297:amf/amfd/node.cc:466 <<
    avd_node_state_set

    <143>1 2018-07-10T14:58:28.033059+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="21858"] 297:amf/amfd/ndfsm.cc:1134 <<
    avd_node_mark_absent

    <143>1 2018-07-10T14:58:28.033102+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="21861"] 297:amf/amfd/sgproc.cc:2177 >>
    avd_node_down_mw_susi_failover:
    'safAmfNode=PL-4,safAmfCluster=myAmfCluster'

    <143>1 2018-07-10T14:58:28.060869+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="22222"] 297:amf/amfd/sgproc.cc:2223 <<
    avd_node_down_mw_susi_failover

    <143>1 2018-07-10T14:58:28.06088+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="22223"] 297:amf/amfd/sgproc.cc:2238 >>
    avd_node_down_appl_susi_failover:
    'safAmfNode=PL-4,safAmfCluster=myAmfCluster'

    <143>1 2018-07-10T14:58:28.076943+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="22554"] 297:amf/amfd/sgproc.cc:2311 <<
    avd_node_down_appl_susi_failover

    <143>1 2018-07-10T14:58:28.077075+10:00 SC-1 osafamfd 297 osafamfd
    [meta sequenceId="22562"] 297:amf/amfd/ndproc.cc:1234 <<
    avd_node_failover

    *Best Regards,*

    *Thuan *

    *From:* nagen...@hasolutions.in <mailto:nagen...@hasolutions.in>
    <nagen...@hasolutions.in> <mailto:nagen...@hasolutions.in>
    *Sent:* Tuesday, July 10, 2018 1:36 PM
    *To:* Minh Hon Chau <minh.c...@dektech.com.au>
    <mailto:minh.c...@dektech.com.au>; Tran Thuan
    <thuan.t...@dektech.com.au> <mailto:thuan.t...@dektech.com.au>;
    hans.nordeb...@ericsson.com <mailto:hans.nordeb...@ericsson.com>;
    gary....@dektech.com.au <mailto:gary....@dektech.com.au>
    *Cc:* opensaf-devel@lists.sourceforge.net
    <mailto:opensaf-devel@lists.sourceforge.net>
    *Subject:* RE: Re: [PATCH 1/1] amf: amfd should reset msg_id
    counter to avoid message ID mismatch [#2891]

    Hi Thuan,

                          I agree with Minh. We need to know why Node
    left has come before and Mds down came later and about other
    flows. I would be great if you could please reproduce the scenario
    and upload the traces of payload and controller(Please keep the
    time in sync) in the ticket.

    Thanks,

    Nagendra, 91-9866424860

    www.hasolutions.in <http://www.hasolutions.in>

    https://www.linkedin.com/company/hasolutions/

    High Availability Solutions Pvt. Ltd.

    - OpenSAF support and services

        --------- Original Message ---------

        Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter
        to avoid message ID mismatch [#2891]
        From: "Minh Hon Chau" <minh.c...@dektech.com.au
        <mailto:minh.c...@dektech.com.au>>
        Date: 7/10/18 10:49 am
        To: "Tran Thuan" <thuan.t...@dektech.com.au
        <mailto:thuan.t...@dektech.com.au>>, nagen...@hasolutions.in
        <mailto:nagen...@hasolutions.in>, hans.nordeb...@ericsson.com
        <mailto:hans.nordeb...@ericsson.com>, gary....@dektech.com.au
        <mailto:gary....@dektech.com.au>
        Cc: opensaf-devel@lists.sourceforge.net
        <mailto:opensaf-devel@lists.sourceforge.net>

        Hi again,

        I am not sure how you replicate this scenario. It looks to me you
        deliberately reboot the PL-3 before amfd process the synced
        assignment.

        The node_up sequence has many branches, you are replicating
        one of those.

        What I mean, at the time amfd calls
        avd_process_state_info_queue(), PL
        has not joined the AMF cluster and PL keeps resending node_up
        message
        for every one second.

        When amfd finishes creating synced assignment, the node sync
        windows is
        still open, the cb->init_state has not moved to APP_STATE, so the
        node_up of PL is still acceptable. At this time a PL can be
        reboot by
        someone.

        For your reference, I attach to ticket the full trace of amfd,
        I have to
        turn off PL-3 to trigger the node sync window up 10 seconds, I
        reboot
        PL-4. Please note this scenario of node sync timeout did
        happen at
        application sides.

        Reseting the counters in CLM callback is easy way, but it's
        not fully
        fixed and not make us safe from the issue.

        Thanks

        Minh


        On 10/07/18 14:00, Tran Thuan wrote:
        > Hi Minh,
        >
        > I check your concern about if PL down before sending node up
        but after synced assignments (recover after headless).
        > Current code still work correctly. Because active AMFD don't
        recreate assignments.
        > So, my point of view is still there, just reset 3 fields as
        current patch.
        >
        > Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO Synced
        SISU:safSi=NoRed2,safApp=OpenSAF,safSu=PL-3,safSg=NoRed,safApp=OpenSAF
        <1, 3>
        > Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO Synced
        SU:safSu=PL-3,safSg=NoRed,safApp=OpenSAF <0, 1, 3>
        > Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO 8 CSICOMP states sent
        > Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO 8 COMP states sent
        > Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO Sending node up
        due to NCSMDS_NEW_ACTIVE
        > Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO
        ====================delay to power off this node
        ====================
        > Jul 10 13:40:05 PL-2-3 osafckptnd[5984]: NO CLM selection
        object was updated. (17)
        > Jul 10 13:40:06 PL-2-3 osafamfnd[5971]: NO
        ====================delay to power off this node
        ====================
        > Jul 10 13:40:07 PL-2-3 osafamfnd[5971]: NO
        ====================delay to power off this node
        ====================
        > Jul 10 13:40:08 PL-2-3 osafamfnd[5971]: NO
        ====================delay to power off this node
        ====================
        >
        > <143>1 2018-07-10T13:40:15.365252+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13233"] 4827:amf/amfd/ndfsm.cc:296
        >> avd_node_up_evh: from 2010f,
        safAmfNode=SC-1,safAmfCluster=myAmfCluster
        > <143>1 2018-07-10T13:40:15.365259+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13234"] 4827:amf/amfd/node.cc:231
        >> avd_node_get: safAmfNode=SC-1,safAmfCluster=myAmfCluster
        > <143>1 2018-07-10T13:40:15.365263+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13235"] 4827:amf/amfd/node.cc:232
        << avd_node_get
        > <143>1 2018-07-10T13:40:15.365314+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13236"] 4827:amf/amfd/ndfsm.cc:216
        >> avd_count_node_up
        > <143>1 2018-07-10T13:40:15.36532+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13237"] 4827:amf/amfd/ndfsm.cc:226
        TR Number of node director(s) that director received node_up msg:0
        > <143>1 2018-07-10T13:40:15.365325+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13238"] 4827:amf/amfd/ndfsm.cc:228
        << avd_count_node_up
        > <143>1 2018-07-10T13:40:15.365329+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13239"] 4827:amf/amfd/ndfsm.cc:353
        TR NodeSync timer is active, ignore this node_up msg
        (nodeid:2010f)
        > <143>1 2018-07-10T13:40:15.365334+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13240"] 4827:amf/amfd/ndfsm.cc:534
        << avd_node_up_evh
        >
        > <143>1 2018-07-10T13:40:16.462939+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13280"] 4827:amf/amfd/ndfsm.cc:296
        >> avd_node_up_evh: from 2010f,
        safAmfNode=SC-1,safAmfCluster=myAmfCluster
        > <141>1 2018-07-10T13:40:16.462975+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13281"] 4827:amf/amfd/ndfsm.cc:254
        NO Received node_up from 2010f: msg_id 1
        > <143>1 2018-07-10T13:40:16.462981+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13282"] 4827:amf/amfd/ndfsm.cc:63
        >> avd_process_state_info_queue
        > <143>1 2018-07-10T13:40:16.462986+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13283"] 4827:amf/amfd/ndfsm.cc:65
        TR queue_size before processing: 2
        > <143>1 2018-07-10T13:40:16.463162+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13284"] 4827:amf/amfd/ndfsm.cc:73
        TR rcv_evt: 12
        > <143>1 2018-07-10T13:40:16.463175+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13285"] 4827:amf/amfd/ndfsm.cc:79
        TR msg_type: 31
        > <143>1 2018-07-10T13:40:16.463203+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13286"] 4827:amf/amfd/siass.cc:1132
        >> avd_susi_recreate: msg_id: 0 node_id: 131855 num_sisu: 3
        > <139>1 2018-07-10T13:40:16.463224+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13287"] 4827:amf/amfd/siass.cc:1140
        ER Node 2030f has left the cluster
        > <143>1 2018-07-10T13:40:16.463258+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13288"] 4827:amf/amfd/siass.cc:0 <<
        avd_susi_recreate
        > <143>1 2018-07-10T13:40:16.463273+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13289"]
        4827:amf/common/d2nmsg.c:931 >> avsv_free_n2d_nd_sisu_state_info
        > <143>1 2018-07-10T13:40:16.463278+10:00 SC-2-1 osafamfd 4827
        osafamfd [meta sequenceId="13290"]
        4827:amf/common/d2nmsg.c:953 TR 3 SISU records to free
        >
        > Best Regards,
        > Thuan
        >
        > -----Original Message-----
        > From: Minh Hon Chau <minh.c...@dektech.com.au
        <mailto:minh.c...@dektech.com.au>>
        > Sent: Monday, July 9, 2018 4:04 PM
        > To: Tran Thuan <thuan.t...@dektech.com.au
        <mailto:thuan.t...@dektech.com.au>>; nagen...@hasolutions.in
        <mailto:nagen...@hasolutions.in>; hans.nordeb...@ericsson.com
        <mailto:hans.nordeb...@ericsson.com>; gary....@dektech.com.au
        <mailto:gary....@dektech.com.au>
        > Cc: opensaf-devel@lists.sourceforge.net
        <mailto:opensaf-devel@lists.sourceforge.net>
        > Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id
        counter to avoid message ID mismatch [#2891]
        >
        > Hi,
        >
        > How about the @nodes_exit_cnt? The active amfd checkpoints
        this @nodes_exit_cnt to the standby amfd inside mds down
        event. You can say it's not used but it's what's happening for
        this counter.
        >
        > "AMFND not yet up", what you mean is node_state
        AVD_AVND_STATE_ABSENT?
        > You reset counter if node state as AVD_AVND_STATE_ABSENT in
        clm callback.
        > If a payload is coming from headless, it syncs the
        assignments to the active amfd first and enters node_up
        sequence. This node_up sequence can take up to 10 seconds
        before this payload becomes AVD_AVND_STATE_PRESENT.
        > If this issue happens during this 10 seconds, the node_state
        is still AVD_AVND_STATE_ABSENT and active amfd does not
        process mds down event properly, how could amfd remove the
        synced assignment?
        >
        > As mentioned, the problem is not that the active amfd is
        missing 1, 2 or
        > 3 counters to be reset, it misses to process the mds down
        event properly.
        >
        > Are you able to reproduce this problem?
        >
        > Thanks
        > Minh
        >
        > On 09/07/18 16:33, Tran Thuan wrote:
        >> Hi Minh,
        >>
        >> Since AMFND not yet up before, other fields are not updated
        yet.
        >> I think only 3 fields is changed in this scenario, that's
        why my is reset them only.
        >> Do you think so?
        >>
        >> Best Regards,
        >> Thuan
        >>
        >> -----Original Message-----
        >> From: Minh Hon Chau <minh.c...@dektech.com.au
        <mailto:minh.c...@dektech.com.au>>
        >> Sent: Monday, July 9, 2018 1:22 PM
        >> To: Tran Thuan <thuan.t...@dektech.com.au
        <mailto:thuan.t...@dektech.com.au>>; nagen...@hasolutions.in
        <mailto:nagen...@hasolutions.in>;
        >> hans.nordeb...@ericsson.com
        <mailto:hans.nordeb...@ericsson.com>; gary....@dektech.com.au
        <mailto:gary....@dektech.com.au>
        >> Cc: opensaf-devel@lists.sourceforge.net
        <mailto:opensaf-devel@lists.sourceforge.net>
        >> Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id
        counter to
        >> avoid message ID mismatch [#2891]
        >>
        >> Hi,
        >>
        >> Ok, so we can ignore the "AMFND already up" case.
        >>
        >> "- AMFND not yet up before (node reboot up/first up): this
        is our issue.
        >> + current only delete node without care msg_id counter."
        >>
        >> Are you able to reproduce this problem?
        >> You only reset the msg_id counter of a node, how about the
        others:
        >> adest, recvr_fail_sw, node_info.initialViewNumber,
        nodes_exit_cnt,..
        >> they all are checked in the other places.
        >> Also, when the node goes down, the standby resets all of
        above variables directly, while the active doesn't as of this
        issue or only resets msg_id counters within your patch. The
        active and standby may not be consistent.
        >>
        >> Another thing you may consider, if any new/old
        attributes/variables of AVD_AVND is going to be added/removed,
        we only look at one place - event of MDS down, don't have to
        look at/reset those ones in the CLM callback.
        >>
        >> Thanks
        >> Minh
        >>
        >> On 09/07/18 12:12, Tran Thuan wrote:
        >>> Hi Minh,
        >>>
        >>> OK, so CLM callback can handle before/after AMFND down.
        >>> In case AMFND down then CLM callback, current flow works
        correctly.
        >>> In case CLM callback then AMFND down:
        >>> - AMFND already up before: current flow will do
        clm_node_exit_complete(), still works correctly.
        >>> - AMFND not yet up before (node reboot up/first up): this
        is our issue.
        >>> + current only delete node without care msg_id counter.
        >>>
        >>> Best Regards,
        >>> Thuan
        >>>
        >>> -----Original Message-----
        >>> From: Minh Hon Chau <minh.c...@dektech.com.au
        <mailto:minh.c...@dektech.com.au>>
        >>> Sent: Monday, July 9, 2018 8:52 AM
        >>> To: Tran Thuan <thuan.t...@dektech.com.au
        <mailto:thuan.t...@dektech.com.au>>; nagen...@hasolutions.in
        <mailto:nagen...@hasolutions.in>;
        >>> hans.nordeb...@ericsson.com
        <mailto:hans.nordeb...@ericsson.com>; gary....@dektech.com.au
        <mailto:gary....@dektech.com.au>
        >>> Cc: opensaf-devel@lists.sourceforge.net
        <mailto:opensaf-devel@lists.sourceforge.net>
        >>> Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id
        counter to
        >>> avoid message ID mismatch [#2891]
        >>>
        >>> This problem possibly happens in normal situation, not
        only "AMFND not yet up", so seeing it as a race and can also
        fixes the problem in case of "AMFND already up"
        >>>
        >>>
        >>> On 09/07/18 11:41, Minh Hon Chau wrote:
        >>>> Hi Thuan,
        >>>>
        >>>> I often see mds down coming before clm callback in my
        test, the
        >>>> problem in this ticket has those events coming reverse order.
        >>>>
        >>>> <143>1 2018-07-09T11:19:22.185279+10:00 SC-1 osafamfd 372
        osafamfd
        >>>> [meta sequenceId="23028"] 372:amf/amfd/ndfsm.cc:772 >>
        >>>> avd_mds_avnd_down_evh: 2030f, 0x55895024e410
        >>>> <143>1 2018-07-09T11:19:22.185303+10:00 SC-1 osafamfd 372
        osafamfd
        >>>> [meta sequenceId="23029"] 372:amf/amfd/ndproc.cc:1219 >>
        >>>> avd_node_failover:
        'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
        >>>> <143>1 2018-07-09T11:19:22.185324+10:00 SC-1 osafamfd 372
        osafamfd
        >>>> [meta sequenceId="23030"] 372:amf/amfd/ndfsm.cc:1111 >>
        >>>> avd_node_mark_absent ...
        >>>> <143>1 2018-07-09T11:19:22.253699+10:00 SC-1 osafamfd 372
        osafamfd
        >>>> [meta sequenceId="23890"] 372:amf/amfd/ndfsm.cc:821 <<
        >>>> avd_mds_avnd_down_evh
        >>>>
        >>>> <143>1 2018-07-09T11:19:22.272188+10:00 SC-1 osafamfd 372
        osafamfd
        >>>> [meta sequenceId="23917"] 372:clm/agent/clma_api.cc:786 >>
        >>>> saClmDispatch ...
        >>>> <143>1 2018-07-09T11:19:22.272268+10:00 SC-1 osafamfd 372
        osafamfd
        >>>> [meta sequenceId="23926"] 372:amf/amfd/clm.cc:223 >>
        clm_track_cb: '0'
        >>>> '4' '1'
        >>>> <143>1 2018-07-09T11:19:22.272277+10:00 SC-1 osafamfd 372
        osafamfd
        >>>> [meta sequenceId="23927"] 372:amf/amfd/clm.cc:239 TR
        >>>> numberOfMembers:'4', numberOfItems:'1'
        >>>> <143>1 2018-07-09T11:19:22.272285+10:00 SC-1 osafamfd 372
        osafamfd
        >>>> [meta sequenceId="23928"] 372:amf/amfd/clm.cc:245 TR i = 0,
        >>>> node:'safNode=PL-3,safCluster=myClmCluster', clusterChange:3
        >>>> <142>1 2018-07-09T11:19:22.272294+10:00 SC-1 osafamfd 372
        osafamfd
        >>>> [meta sequenceId="23929"] 372:amf/amfd/clm.cc:320 IN
        clm_track_cb:
        >>>> CLM node 'safNode=PL-3,safCluster=myClmCluster' is not an AMF
        >>>> cluster member; MDS down received
        >>>> <143>1 2018-07-09T11:19:22.272303+10:00 SC-1 osafamfd 372
        osafamfd
        >>>> [meta sequenceId="23930"] 372:amf/amfd/clm.cc:443 <<
        clm_track_cb ...
        >>>> <143>1 2018-07-09T11:19:22.272352+10:00 SC-1 osafamfd 372
        osafamfd
        >>>> [meta sequenceId="23936"] 372:clm/agent/clma_api.cc:834 <<
        >>>> saClmDispatch
        >>>>
        >>>>
        >>>> Reset counter can avoid the msg mismatch, but I'm not
        sure it's
        >>>> completely fixed since the mds down event is not handled
        properly.
        >>>> If you have mds down event handled properly in both
        cases, will
        >>>> there be this msg mismatch problem?
        >>>>
        >>>> Thanks,
        >>>> Minh
        >>>>
        >>>> On 09/07/18 11:11, Tran Thuan wrote:
        >>>>> Hi Minh,
        >>>>>
        >>>>> If I understand correctly, CLM callback always before
        AMFND down/up
        >>>>> event.
        >>>>> The problem is just because AMFND not yet up and CLM
        callback
        >>>>> receive MDS down of node and delete node.
        >>>>> After that AMFND down event come cannot find the node,
        cannot find
        >>>>> the node is expected I guess.
        >>>>> So, we only need to reset counter to make thing right,
        what do you
        >>>>> think?
        >>>>>
        >>>>> <143>1 2018-07-03T18:40:27.239707+10:00 SC-2-1 osafamfd 4815
        >>>>> osafamfd [meta sequenceId="2448"]
        4815:amf/amfd/clm.cc:223 >> clm_track_cb:
        >>>>> '0' '4' '1'
        >>>>> <142>1 2018-07-03T18:40:27.239778+10:00 SC-2-1 osafamfd 4815
        >>>>> osafamfd [meta sequenceId="2451"]
        4815:amf/amfd/clm.cc:299 IN clm_track_cb:
        >>>>> CLM node 'safNode=PL-2-3,safCluster=myClmCluster' is not
        an AMF
        >>>>> cluster member; MDS down received
        >>>>> <143>1 2018-07-03T18:40:27.239792+10:00 SC-2-1 osafamfd 4815
        >>>>> osafamfd [meta sequenceId="2452"]
        4815:amf/amfd/clm.cc:418 <<
        >>>>> clm_track_cb
        >>>>> <143>1 2018-07-03T18:40:27.243287+10:00 SC-2-1 osafamfd 4815
        >>>>> osafamfd [meta sequenceId="2465"]
        4815:amf/amfd/ndfsm.cc:772 >>
        >>>>> avd_mds_avnd_down_evh: 2030f, (nil)
        >>>>> <143>1 2018-07-03T18:40:27.243307+10:00 SC-2-1 osafamfd 4815
        >>>>> osafamfd [meta sequenceId="2466"]
        4815:amf/amfd/ndfsm.cc:821 <<
        >>>>> avd_mds_avnd_down_evh
        >>>>>
        >>>>> <143>1 2018-07-03T18:41:37.965003+10:00 SC-2-1 osafamfd 4815
        >>>>> osafamfd [meta sequenceId="2596"]
        4815:amf/amfd/clm.cc:223 >> clm_track_cb:
        >>>>> '0' '4' '1'
        >>>>> <143>1 2018-07-03T18:41:37.965104+10:00 SC-2-1 osafamfd 4815
        >>>>> osafamfd [meta sequenceId="2605"]
        4815:amf/amfd/clm.cc:418 <<
        >>>>> clm_track_cb
        >>>>> <143>1 2018-07-03T18:41:45.016379+10:00 SC-2-1 osafamfd 4815
        >>>>> osafamfd [meta sequenceId="2631"]
        4815:amf/amfd/ndfsm.cc:750 TR
        >>>>> amfnd on 2030f is up
        >>>>> <143>1 2018-07-03T18:41:45.139262+10:00 SC-2-1 osafamfd 4815
        >>>>> osafamfd [meta sequenceId="2686"]
        4815:amf/amfd/ndfsm.cc:772 >>
        >>>>> avd_mds_avnd_down_evh: 2030f, 0x7ff9ac3dc940
        >>>>> <143>1 2018-07-03T18:41:45.141883+10:00 SC-2-1 osafamfd 4815
        >>>>> osafamfd [meta sequenceId="3008"]
        4815:amf/amfd/ndfsm.cc:821 <<
        >>>>> avd_mds_avnd_down_evh
        >>>>>
        >>>>> Best Regards,
        >>>>> Thuan
        >>>>>
        >>>>>
        >>>>> -----Original Message-----
        >>>>> From: Minh Hon Chau <minh.c...@dektech.com.au
        <mailto:minh.c...@dektech.com.au>>
        >>>>> Sent: Monday, July 9, 2018 7:15 AM
        >>>>> To: thuan.tran <thuan.t...@dektech.com.au
        <mailto:thuan.t...@dektech.com.au>>;
        >>>>> nagen...@hasolutions.in
        <mailto:nagen...@hasolutions.in>; hans.nordeb...@ericsson.com
        <mailto:hans.nordeb...@ericsson.com>;
        >>>>> gary....@dektech.com.au <mailto:gary....@dektech.com.au>
        >>>>> Cc: opensaf-devel@lists.sourceforge.net
        <mailto:opensaf-devel@lists.sourceforge.net>
        >>>>> Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id
        counter to
        >>>>> avoid message ID mismatch [#2891]
        >>>>>
        >>>>> Hi,
        >>>>>
        >>>>> In normal situation when I stop Opensaf or reboot the
        node, I often
        >>>>> get MDS down first, then CLM track callback. [1]
        >>>>>
        >>>>> The problem in this ticket is because AMFD has CLM track
        callback
        >>>>> coming before MDS down. [2]
        >>>>>
        >>>>> We need to see it as a race condition, not because of
        missing the
        >>>>> counters reset, so that the amfd's function to deal with
        MDS down
        >>>>> should be called. That would be equivalent to [1], which is
        >>>>> currently working.
        >>>>>
        >>>>> With this patch, we still don't have
        avd_mds_avnd_down_evh() to be
        >>>>> called, that function processes many things in the event
        of mds down.
        >>>>> (ie. @nodes_exit_cnt, ...)
        >>>>>
        >>>>> I think the node_info.member should be set TRUE or FALSE
        in the
        >>>>> event of CLM callback since it means CLM membership. We
        can use
        >>>>> both @member and one of variables indicating MDS down,
        to trap each
        >>>>> other in the events of MDS down and CLM callback, so
        that in the
        >>>>> end we have
        >>>>> avd_mds_avnd_down_evh() and nodeid is removed out of
        node_id_db in
        >>>>> both [1] and [2].
        >>>>>
        >>>>> Thanks,
        >>>>> Minh
        >>>>> On 06/07/18 20:36, thuan.tran wrote:
        >>>>>> There is a case that after AMFD send reboot order due
        to “out of
        >>>>>> sync window”.
        >>>>>> AMFD receive CLM track callback but node is not member
        yet and
        >>>>>> delete node.
        >>>>>> Later AMFND MDS down will not reset msg_id counter
        since it cannot
        >>>>>> find node.
        >>>>>> When node reboot up, AMFD continue use current msg_id
        counter send
        >>>>>> to AMFND cause messasge ID mismatch in AMFND then AMFND
        order
        >>>>>> reboot itself node.
        >>>>>> ---
        >>>>>> src/amf/amfd/clm.cc | 7 ++++++-
        >>>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
        >>>>>>
        >>>>>> diff --git a/src/amf/amfd/clm.cc b/src/amf/amfd/clm.cc
        index
        >>>>>> e113a65f9..4a15d5ad7 100644
        >>>>>> --- a/src/amf/amfd/clm.cc
        >>>>>> +++ b/src/amf/amfd/clm.cc
        >>>>>> @@ -316,9 +316,14 @@ static void clm_track_cb(
        >>>>>> __FUNCTION__, node_name.c_str());
        >>>>>> goto done;
        >>>>>> } else if (node->node_state == AVD_AVND_STATE_ABSENT) {
        >>>>>> - LOG_IN("%s: CLM node '%s' is not an AMF cluster member;
        >>>>>> MDS down received",
        >>>>>> + LOG_IN("%s: CLM node '%s' is ABSENT; MDS down
        >>>>>> +received",
        >>>>>> __FUNCTION__, node_name.c_str());
        >>>>>> avd_node_delete_nodeid(node);
        >>>>>> + /* Reset msg_id because AVND MDS down may come later
        >>>>>> + and cannot find node to reset these, cause message
        >>>>>> + ID
        >>>>>> mismatch. */
        >>>>>> + node->rcv_msg_id = 0;
        >>>>>> + node->snd_msg_id = 0;
        >>>>>> + node->node_info.member = SA_FALSE;
        >>>>>> goto done;
        >>>>>> }
        >>>>>> TRACE(" Node Left: rootCauseEntity %s for node %u",
        >>
        >
        >


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to