Re: [devel] [PATCH 1/1] amf: amfd should reset msg_id counter to avoid message ID mismatch [#2891]

Minh Hon Chau Mon, 09 Jul 2018 22:20:01 -0700

Hi again,

I am not sure how you replicate this scenario. It looks to me youdeliberately reboot the PL-3 before amfd process the synced assignment.


The node_up sequence has many branches, you are replicating one of those.

What I mean, at the time amfd calls avd_process_state_info_queue(), PLhas not joined the AMF cluster and PL keeps resending node_up messagefor every one second.

When amfd finishes creating synced assignment, the node sync windows isstill open, the cb->init_state has not moved to APP_STATE, so thenode_up of PL is still acceptable. At this time a PL can be reboot bysomeone.

For your reference, I attach to ticket the full trace of amfd, I have toturn off PL-3 to trigger the node sync window up 10 seconds, I rebootPL-4. Please note this scenario of node sync timeout did happen atapplication sides.

Reseting the counters in CLM callback is easy way, but it's not fullyfixed and not make us safe from the issue.


Thanks

Minh


On 10/07/18 14:00, Tran Thuan wrote:

Hi Minh,

I check your concern about if PL down before sending node up but after synced 
assignments (recover after headless).
Current code still work correctly. Because active AMFD don't recreate 
assignments.
So, my point of view is still there, just reset 3 fields as current patch.

Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO Synced 
SISU:safSi=NoRed2,safApp=OpenSAF,safSu=PL-3,safSg=NoRed,safApp=OpenSAF <1, 3>
Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO Synced 
SU:safSu=PL-3,safSg=NoRed,safApp=OpenSAF <0, 1, 3>
Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO 8 CSICOMP states sent
Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO 8 COMP states sent
Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO Sending node up due to 
NCSMDS_NEW_ACTIVE
Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO ====================delay to power 
off this node ====================
Jul 10 13:40:05 PL-2-3 osafckptnd[5984]: NO CLM selection object was updated. 
(17)
Jul 10 13:40:06 PL-2-3 osafamfnd[5971]: NO ====================delay to power 
off this node ====================
Jul 10 13:40:07 PL-2-3 osafamfnd[5971]: NO ====================delay to power 
off this node ====================
Jul 10 13:40:08 PL-2-3 osafamfnd[5971]: NO ====================delay to power 
off this node ====================

<143>1 2018-07-10T13:40:15.365252+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13233"] 4827:amf/amfd/ndfsm.cc:296 >> avd_node_up_evh: from 2010f, 
safAmfNode=SC-1,safAmfCluster=myAmfCluster
<143>1 2018-07-10T13:40:15.365259+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13234"] 4827:amf/amfd/node.cc:231 >> avd_node_get: 
safAmfNode=SC-1,safAmfCluster=myAmfCluster
<143>1 2018-07-10T13:40:15.365263+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13235"] 4827:amf/amfd/node.cc:232 << avd_node_get
<143>1 2018-07-10T13:40:15.365314+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13236"] 4827:amf/amfd/ndfsm.cc:216 >> avd_count_node_up
<143>1 2018-07-10T13:40:15.36532+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13237"] 4827:amf/amfd/ndfsm.cc:226 TR Number of node director(s) that 
director received node_up msg:0
<143>1 2018-07-10T13:40:15.365325+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13238"] 4827:amf/amfd/ndfsm.cc:228 << avd_count_node_up
<143>1 2018-07-10T13:40:15.365329+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13239"] 4827:amf/amfd/ndfsm.cc:353 TR NodeSync timer is active, ignore 
this node_up msg (nodeid:2010f)
<143>1 2018-07-10T13:40:15.365334+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13240"] 4827:amf/amfd/ndfsm.cc:534 << avd_node_up_evh

<143>1 2018-07-10T13:40:16.462939+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13280"] 4827:amf/amfd/ndfsm.cc:296 >> avd_node_up_evh: from 2010f, 
safAmfNode=SC-1,safAmfCluster=myAmfCluster
<141>1 2018-07-10T13:40:16.462975+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13281"] 4827:amf/amfd/ndfsm.cc:254 NO Received node_up from 2010f: msg_id 
1
<143>1 2018-07-10T13:40:16.462981+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13282"] 4827:amf/amfd/ndfsm.cc:63 >> avd_process_state_info_queue
<143>1 2018-07-10T13:40:16.462986+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13283"] 4827:amf/amfd/ndfsm.cc:65 TR queue_size before processing: 2
<143>1 2018-07-10T13:40:16.463162+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13284"] 4827:amf/amfd/ndfsm.cc:73 TR rcv_evt: 12
<143>1 2018-07-10T13:40:16.463175+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13285"] 4827:amf/amfd/ndfsm.cc:79 TR msg_type: 31
<143>1 2018-07-10T13:40:16.463203+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13286"] 4827:amf/amfd/siass.cc:1132 >> avd_susi_recreate: msg_id: 0 
node_id: 131855 num_sisu: 3
<139>1 2018-07-10T13:40:16.463224+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13287"] 4827:amf/amfd/siass.cc:1140 ER Node 2030f has left the cluster
<143>1 2018-07-10T13:40:16.463258+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13288"] 4827:amf/amfd/siass.cc:0 << avd_susi_recreate
<143>1 2018-07-10T13:40:16.463273+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13289"] 4827:amf/common/d2nmsg.c:931 >> avsv_free_n2d_nd_sisu_state_info
<143>1 2018-07-10T13:40:16.463278+10:00 SC-2-1 osafamfd 4827 osafamfd [meta 
sequenceId="13290"] 4827:amf/common/d2nmsg.c:953 TR 3 SISU records to free

Best Regards,
Thuan

-----Original Message-----
From: Minh Hon Chau <minh.c...@dektech.com.au>
Sent: Monday, July 9, 2018 4:04 PM
To: Tran Thuan <thuan.t...@dektech.com.au>; nagen...@hasolutions.in; 
hans.nordeb...@ericsson.com; gary....@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net
Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to avoid message 
ID mismatch [#2891]

Hi,

How about the @nodes_exit_cnt? The active amfd checkpoints this @nodes_exit_cnt 
to the standby amfd inside mds down event. You can say it's not used but it's 
what's happening for this counter.

"AMFND not yet up", what you mean is node_state AVD_AVND_STATE_ABSENT?
You reset counter if node state as AVD_AVND_STATE_ABSENT in clm callback.
If a payload is coming from headless, it syncs the assignments to the active 
amfd first and enters node_up sequence. This node_up sequence can take up to 10 
seconds before this payload becomes AVD_AVND_STATE_PRESENT.
If this issue happens during this 10 seconds, the node_state is still 
AVD_AVND_STATE_ABSENT and active amfd does not process mds down event properly, 
how could amfd remove the synced assignment?

As mentioned, the problem is not that the active amfd is missing 1, 2 or
3 counters to be reset, it misses to process the mds down event properly.

Are you able to reproduce this problem?

Thanks
Minh

On 09/07/18 16:33, Tran Thuan wrote:

Hi Minh,

Since AMFND not yet up before, other fields are not updated yet.
I think only 3 fields is changed in this scenario, that's why my is reset them 
only.
Do you think so?

Best Regards,
Thuan

-----Original Message-----
From: Minh Hon Chau <minh.c...@dektech.com.au>
Sent: Monday, July 9, 2018 1:22 PM
To: Tran Thuan <thuan.t...@dektech.com.au>; nagen...@hasolutions.in;
hans.nordeb...@ericsson.com; gary....@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net
Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to
avoid message ID mismatch [#2891]

Hi,

Ok, so we can ignore the "AMFND already up" case.

"- AMFND not yet up before (node reboot up/first up): this is our issue.
     + current only delete node without care msg_id counter."

Are you able to reproduce this problem?
You only reset the msg_id counter of a node, how about the others:
adest, recvr_fail_sw, node_info.initialViewNumber, nodes_exit_cnt,..
they all are checked in the other places.
Also, when the node goes down, the standby resets all of above variables 
directly, while the active doesn't as of this issue or only resets msg_id 
counters within your patch. The active and standby may not be consistent.

Another thing you may consider, if any new/old attributes/variables of AVD_AVND 
is going to be added/removed, we only look at one place - event of MDS down, 
don't have to look at/reset those ones in the CLM callback.

Thanks
Minh

On 09/07/18 12:12, Tran Thuan wrote:

Hi Minh,

OK, so CLM callback can handle before/after AMFND down.
In case AMFND down then CLM callback, current flow works correctly.
In case CLM callback then AMFND down:
- AMFND already up before: current flow will do clm_node_exit_complete(), still 
works correctly.
- AMFND not yet up before (node reboot up/first up): this is our issue.
     + current only delete node without care msg_id counter.

Best Regards,
Thuan

-----Original Message-----
From: Minh Hon Chau <minh.c...@dektech.com.au>
Sent: Monday, July 9, 2018 8:52 AM
To: Tran Thuan <thuan.t...@dektech.com.au>; nagen...@hasolutions.in;
hans.nordeb...@ericsson.com; gary....@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net
Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to
avoid message ID mismatch [#2891]

This problem possibly happens in normal situation, not only "AMFND not yet up", so seeing 
it as a race and can also fixes the problem in case of "AMFND already up"


On 09/07/18 11:41, Minh Hon Chau wrote:

Hi Thuan,

I often see mds down coming before clm callback in my test, the
problem in this ticket has those events coming reverse order.

<143>1 2018-07-09T11:19:22.185279+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23028"] 372:amf/amfd/ndfsm.cc:772 >>
avd_mds_avnd_down_evh: 2030f, 0x55895024e410
<143>1 2018-07-09T11:19:22.185303+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23029"] 372:amf/amfd/ndproc.cc:1219 >>
avd_node_failover: 'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
<143>1 2018-07-09T11:19:22.185324+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23030"] 372:amf/amfd/ndfsm.cc:1111 >>
avd_node_mark_absent ...
<143>1 2018-07-09T11:19:22.253699+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23890"] 372:amf/amfd/ndfsm.cc:821 <<
avd_mds_avnd_down_evh

<143>1 2018-07-09T11:19:22.272188+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23917"] 372:clm/agent/clma_api.cc:786 >>
saClmDispatch ...
<143>1 2018-07-09T11:19:22.272268+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23926"] 372:amf/amfd/clm.cc:223 >> clm_track_cb: '0'
'4' '1'
<143>1 2018-07-09T11:19:22.272277+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23927"] 372:amf/amfd/clm.cc:239 TR
numberOfMembers:'4', numberOfItems:'1'
<143>1 2018-07-09T11:19:22.272285+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23928"] 372:amf/amfd/clm.cc:245 TR i = 0,
node:'safNode=PL-3,safCluster=myClmCluster', clusterChange:3
<142>1 2018-07-09T11:19:22.272294+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23929"] 372:amf/amfd/clm.cc:320 IN clm_track_cb:
CLM node 'safNode=PL-3,safCluster=myClmCluster' is not an AMF
cluster member; MDS down received
<143>1 2018-07-09T11:19:22.272303+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23930"] 372:amf/amfd/clm.cc:443 << clm_track_cb ...
<143>1 2018-07-09T11:19:22.272352+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23936"] 372:clm/agent/clma_api.cc:834 <<
saClmDispatch


Reset counter can avoid the msg mismatch, but I'm not sure it's
completely fixed since the mds down event is not handled properly.
If you have mds down event handled properly in both cases, will
there be this msg mismatch problem?

Thanks,
Minh

On 09/07/18 11:11, Tran Thuan wrote:

Hi Minh,

If I understand correctly, CLM callback always before AMFND down/up
event.
The problem is just because AMFND not yet up and CLM callback
receive MDS down of node and delete node.
After that AMFND down event come cannot find the node, cannot find
the node is expected I guess.
So, we only need to reset counter to make thing right, what do you
think?

<143>1 2018-07-03T18:40:27.239707+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2448"] 4815:amf/amfd/clm.cc:223 >> clm_track_cb:
'0' '4' '1'
<142>1 2018-07-03T18:40:27.239778+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2451"] 4815:amf/amfd/clm.cc:299 IN clm_track_cb:
CLM node 'safNode=PL-2-3,safCluster=myClmCluster' is not an AMF
cluster member; MDS down received
<143>1 2018-07-03T18:40:27.239792+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2452"] 4815:amf/amfd/clm.cc:418 <<
clm_track_cb
<143>1 2018-07-03T18:40:27.243287+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2465"] 4815:amf/amfd/ndfsm.cc:772 >>
avd_mds_avnd_down_evh: 2030f, (nil)
<143>1 2018-07-03T18:40:27.243307+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2466"] 4815:amf/amfd/ndfsm.cc:821 <<
avd_mds_avnd_down_evh

<143>1 2018-07-03T18:41:37.965003+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2596"] 4815:amf/amfd/clm.cc:223 >> clm_track_cb:
'0' '4' '1'
<143>1 2018-07-03T18:41:37.965104+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2605"] 4815:amf/amfd/clm.cc:418 <<
clm_track_cb
<143>1 2018-07-03T18:41:45.016379+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2631"] 4815:amf/amfd/ndfsm.cc:750 TR
amfnd on 2030f is up
<143>1 2018-07-03T18:41:45.139262+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2686"] 4815:amf/amfd/ndfsm.cc:772 >>
avd_mds_avnd_down_evh: 2030f, 0x7ff9ac3dc940
<143>1 2018-07-03T18:41:45.141883+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="3008"] 4815:amf/amfd/ndfsm.cc:821 <<
avd_mds_avnd_down_evh

Best Regards,
Thuan


-----Original Message-----
From: Minh Hon Chau <minh.c...@dektech.com.au>
Sent: Monday, July 9, 2018 7:15 AM
To: thuan.tran <thuan.t...@dektech.com.au>;
nagen...@hasolutions.in; hans.nordeb...@ericsson.com;
gary....@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net
Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to
avoid message ID mismatch [#2891]

Hi,

In normal situation when I stop Opensaf or reboot the node, I often
get MDS down first, then CLM track callback. [1]

The problem in this ticket is because AMFD has CLM track callback
coming before MDS down. [2]

We need to see it as a race condition, not because of missing the
counters reset, so that the amfd's function to deal with MDS down
should be called. That would be equivalent to [1], which is
currently working.

With this patch, we still don't have avd_mds_avnd_down_evh() to be
called, that function processes many things in the event of mds down.
(ie. @nodes_exit_cnt, ...)

I think the node_info.member should be set TRUE or FALSE in the
event of CLM callback since it means CLM membership. We can use
both @member and one of variables indicating MDS down, to trap each
other in the events of MDS down and CLM callback, so that in the
end we have
avd_mds_avnd_down_evh() and nodeid is removed out of node_id_db in
both [1] and [2].

Thanks,
Minh
On 06/07/18 20:36, thuan.tran wrote:

There is a case that after AMFD send reboot order due to “out of
sync window”.
AMFD receive CLM track callback but node is not member yet and
delete node.
Later AMFND MDS down will not reset msg_id counter since it cannot
find node.
When node reboot up, AMFD continue use current msg_id counter send
to AMFND cause messasge ID mismatch in AMFND then AMFND order
reboot itself node.
---
      src/amf/amfd/clm.cc | 7 ++++++-
      1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/amf/amfd/clm.cc b/src/amf/amfd/clm.cc index
e113a65f9..4a15d5ad7 100644
--- a/src/amf/amfd/clm.cc
+++ b/src/amf/amfd/clm.cc
@@ -316,9 +316,14 @@ static void clm_track_cb(
                         __FUNCTION__, node_name.c_str());
                  goto done;
                } else if (node->node_state == AVD_AVND_STATE_ABSENT) {
-            LOG_IN("%s: CLM node '%s' is not an AMF cluster member;
MDS down received",
+            LOG_IN("%s: CLM node '%s' is ABSENT; MDS down
+received",
                         __FUNCTION__, node_name.c_str());
                  avd_node_delete_nodeid(node);
+            /* Reset msg_id because AVND MDS down may come later
+              and cannot find node to reset these, cause message
+ ID
mismatch. */
+            node->rcv_msg_id = 0;
+            node->snd_msg_id = 0;
+            node->node_info.member = SA_FALSE;
                  goto done;
                }
                TRACE(" Node Left: rootCauseEntity %s for node %u",



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1/1] amf: amfd should reset msg_id counter to avoid message ID mismatch [#2891]

Reply via email to