Hi again,
I am not sure how you replicate this scenario. It looks to me you
deliberately reboot the PL-3 before amfd process the synced assignment.
The node_up sequence has many branches, you are replicating one of those.
What I mean, at the time amfd calls avd_process_state_info_queue(), PL
has not joined the AMF cluster and PL keeps resending node_up message
for every one second.
When amfd finishes creating synced assignment, the node sync windows is
still open, the cb->init_state has not moved to APP_STATE, so the
node_up of PL is still acceptable. At this time a PL can be reboot by
someone.
For your reference, I attach to ticket the full trace of amfd, I have to
turn off PL-3 to trigger the node sync window up 10 seconds, I reboot
PL-4. Please note this scenario of node sync timeout did happen at
application sides.
Reseting the counters in CLM callback is easy way, but it's not fully
fixed and not make us safe from the issue.
Thanks
Minh
On 10/07/18 14:00, Tran Thuan wrote:
Hi Minh,
I check your concern about if PL down before sending node up but after synced
assignments (recover after headless).
Current code still work correctly. Because active AMFD don't recreate
assignments.
So, my point of view is still there, just reset 3 fields as current patch.
Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO Synced
SISU:safSi=NoRed2,safApp=OpenSAF,safSu=PL-3,safSg=NoRed,safApp=OpenSAF <1, 3>
Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO Synced
SU:safSu=PL-3,safSg=NoRed,safApp=OpenSAF <0, 1, 3>
Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO 8 CSICOMP states sent
Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO 8 COMP states sent
Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO Sending node up due to
NCSMDS_NEW_ACTIVE
Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO ====================delay to power
off this node ====================
Jul 10 13:40:05 PL-2-3 osafckptnd[5984]: NO CLM selection object was updated.
(17)
Jul 10 13:40:06 PL-2-3 osafamfnd[5971]: NO ====================delay to power
off this node ====================
Jul 10 13:40:07 PL-2-3 osafamfnd[5971]: NO ====================delay to power
off this node ====================
Jul 10 13:40:08 PL-2-3 osafamfnd[5971]: NO ====================delay to power
off this node ====================
<143>1 2018-07-10T13:40:15.365252+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13233"] 4827:amf/amfd/ndfsm.cc:296 >> avd_node_up_evh: from 2010f,
safAmfNode=SC-1,safAmfCluster=myAmfCluster
<143>1 2018-07-10T13:40:15.365259+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13234"] 4827:amf/amfd/node.cc:231 >> avd_node_get:
safAmfNode=SC-1,safAmfCluster=myAmfCluster
<143>1 2018-07-10T13:40:15.365263+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13235"] 4827:amf/amfd/node.cc:232 << avd_node_get
<143>1 2018-07-10T13:40:15.365314+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13236"] 4827:amf/amfd/ndfsm.cc:216 >> avd_count_node_up
<143>1 2018-07-10T13:40:15.36532+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13237"] 4827:amf/amfd/ndfsm.cc:226 TR Number of node director(s) that
director received node_up msg:0
<143>1 2018-07-10T13:40:15.365325+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13238"] 4827:amf/amfd/ndfsm.cc:228 << avd_count_node_up
<143>1 2018-07-10T13:40:15.365329+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13239"] 4827:amf/amfd/ndfsm.cc:353 TR NodeSync timer is active, ignore
this node_up msg (nodeid:2010f)
<143>1 2018-07-10T13:40:15.365334+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13240"] 4827:amf/amfd/ndfsm.cc:534 << avd_node_up_evh
<143>1 2018-07-10T13:40:16.462939+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13280"] 4827:amf/amfd/ndfsm.cc:296 >> avd_node_up_evh: from 2010f,
safAmfNode=SC-1,safAmfCluster=myAmfCluster
<141>1 2018-07-10T13:40:16.462975+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13281"] 4827:amf/amfd/ndfsm.cc:254 NO Received node_up from 2010f: msg_id
1
<143>1 2018-07-10T13:40:16.462981+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13282"] 4827:amf/amfd/ndfsm.cc:63 >> avd_process_state_info_queue
<143>1 2018-07-10T13:40:16.462986+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13283"] 4827:amf/amfd/ndfsm.cc:65 TR queue_size before processing: 2
<143>1 2018-07-10T13:40:16.463162+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13284"] 4827:amf/amfd/ndfsm.cc:73 TR rcv_evt: 12
<143>1 2018-07-10T13:40:16.463175+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13285"] 4827:amf/amfd/ndfsm.cc:79 TR msg_type: 31
<143>1 2018-07-10T13:40:16.463203+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13286"] 4827:amf/amfd/siass.cc:1132 >> avd_susi_recreate: msg_id: 0
node_id: 131855 num_sisu: 3
<139>1 2018-07-10T13:40:16.463224+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13287"] 4827:amf/amfd/siass.cc:1140 ER Node 2030f has left the cluster
<143>1 2018-07-10T13:40:16.463258+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13288"] 4827:amf/amfd/siass.cc:0 << avd_susi_recreate
<143>1 2018-07-10T13:40:16.463273+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13289"] 4827:amf/common/d2nmsg.c:931 >> avsv_free_n2d_nd_sisu_state_info
<143>1 2018-07-10T13:40:16.463278+10:00 SC-2-1 osafamfd 4827 osafamfd [meta
sequenceId="13290"] 4827:amf/common/d2nmsg.c:953 TR 3 SISU records to free
Best Regards,
Thuan
-----Original Message-----
From: Minh Hon Chau <minh.c...@dektech.com.au>
Sent: Monday, July 9, 2018 4:04 PM
To: Tran Thuan <thuan.t...@dektech.com.au>; nagen...@hasolutions.in;
hans.nordeb...@ericsson.com; gary....@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net
Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to avoid message
ID mismatch [#2891]
Hi,
How about the @nodes_exit_cnt? The active amfd checkpoints this @nodes_exit_cnt
to the standby amfd inside mds down event. You can say it's not used but it's
what's happening for this counter.
"AMFND not yet up", what you mean is node_state AVD_AVND_STATE_ABSENT?
You reset counter if node state as AVD_AVND_STATE_ABSENT in clm callback.
If a payload is coming from headless, it syncs the assignments to the active
amfd first and enters node_up sequence. This node_up sequence can take up to 10
seconds before this payload becomes AVD_AVND_STATE_PRESENT.
If this issue happens during this 10 seconds, the node_state is still
AVD_AVND_STATE_ABSENT and active amfd does not process mds down event properly,
how could amfd remove the synced assignment?
As mentioned, the problem is not that the active amfd is missing 1, 2 or
3 counters to be reset, it misses to process the mds down event properly.
Are you able to reproduce this problem?
Thanks
Minh
On 09/07/18 16:33, Tran Thuan wrote:
Hi Minh,
Since AMFND not yet up before, other fields are not updated yet.
I think only 3 fields is changed in this scenario, that's why my is reset them
only.
Do you think so?
Best Regards,
Thuan
-----Original Message-----
From: Minh Hon Chau <minh.c...@dektech.com.au>
Sent: Monday, July 9, 2018 1:22 PM
To: Tran Thuan <thuan.t...@dektech.com.au>; nagen...@hasolutions.in;
hans.nordeb...@ericsson.com; gary....@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net
Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to
avoid message ID mismatch [#2891]
Hi,
Ok, so we can ignore the "AMFND already up" case.
"- AMFND not yet up before (node reboot up/first up): this is our issue.
+ current only delete node without care msg_id counter."
Are you able to reproduce this problem?
You only reset the msg_id counter of a node, how about the others:
adest, recvr_fail_sw, node_info.initialViewNumber, nodes_exit_cnt,..
they all are checked in the other places.
Also, when the node goes down, the standby resets all of above variables
directly, while the active doesn't as of this issue or only resets msg_id
counters within your patch. The active and standby may not be consistent.
Another thing you may consider, if any new/old attributes/variables of AVD_AVND
is going to be added/removed, we only look at one place - event of MDS down,
don't have to look at/reset those ones in the CLM callback.
Thanks
Minh
On 09/07/18 12:12, Tran Thuan wrote:
Hi Minh,
OK, so CLM callback can handle before/after AMFND down.
In case AMFND down then CLM callback, current flow works correctly.
In case CLM callback then AMFND down:
- AMFND already up before: current flow will do clm_node_exit_complete(), still
works correctly.
- AMFND not yet up before (node reboot up/first up): this is our issue.
+ current only delete node without care msg_id counter.
Best Regards,
Thuan
-----Original Message-----
From: Minh Hon Chau <minh.c...@dektech.com.au>
Sent: Monday, July 9, 2018 8:52 AM
To: Tran Thuan <thuan.t...@dektech.com.au>; nagen...@hasolutions.in;
hans.nordeb...@ericsson.com; gary....@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net
Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to
avoid message ID mismatch [#2891]
This problem possibly happens in normal situation, not only "AMFND not yet up", so seeing
it as a race and can also fixes the problem in case of "AMFND already up"
On 09/07/18 11:41, Minh Hon Chau wrote:
Hi Thuan,
I often see mds down coming before clm callback in my test, the
problem in this ticket has those events coming reverse order.
<143>1 2018-07-09T11:19:22.185279+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23028"] 372:amf/amfd/ndfsm.cc:772 >>
avd_mds_avnd_down_evh: 2030f, 0x55895024e410
<143>1 2018-07-09T11:19:22.185303+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23029"] 372:amf/amfd/ndproc.cc:1219 >>
avd_node_failover: 'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
<143>1 2018-07-09T11:19:22.185324+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23030"] 372:amf/amfd/ndfsm.cc:1111 >>
avd_node_mark_absent ...
<143>1 2018-07-09T11:19:22.253699+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23890"] 372:amf/amfd/ndfsm.cc:821 <<
avd_mds_avnd_down_evh
<143>1 2018-07-09T11:19:22.272188+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23917"] 372:clm/agent/clma_api.cc:786 >>
saClmDispatch ...
<143>1 2018-07-09T11:19:22.272268+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23926"] 372:amf/amfd/clm.cc:223 >> clm_track_cb: '0'
'4' '1'
<143>1 2018-07-09T11:19:22.272277+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23927"] 372:amf/amfd/clm.cc:239 TR
numberOfMembers:'4', numberOfItems:'1'
<143>1 2018-07-09T11:19:22.272285+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23928"] 372:amf/amfd/clm.cc:245 TR i = 0,
node:'safNode=PL-3,safCluster=myClmCluster', clusterChange:3
<142>1 2018-07-09T11:19:22.272294+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23929"] 372:amf/amfd/clm.cc:320 IN clm_track_cb:
CLM node 'safNode=PL-3,safCluster=myClmCluster' is not an AMF
cluster member; MDS down received
<143>1 2018-07-09T11:19:22.272303+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23930"] 372:amf/amfd/clm.cc:443 << clm_track_cb ...
<143>1 2018-07-09T11:19:22.272352+10:00 SC-1 osafamfd 372 osafamfd
[meta sequenceId="23936"] 372:clm/agent/clma_api.cc:834 <<
saClmDispatch
Reset counter can avoid the msg mismatch, but I'm not sure it's
completely fixed since the mds down event is not handled properly.
If you have mds down event handled properly in both cases, will
there be this msg mismatch problem?
Thanks,
Minh
On 09/07/18 11:11, Tran Thuan wrote:
Hi Minh,
If I understand correctly, CLM callback always before AMFND down/up
event.
The problem is just because AMFND not yet up and CLM callback
receive MDS down of node and delete node.
After that AMFND down event come cannot find the node, cannot find
the node is expected I guess.
So, we only need to reset counter to make thing right, what do you
think?
<143>1 2018-07-03T18:40:27.239707+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2448"] 4815:amf/amfd/clm.cc:223 >> clm_track_cb:
'0' '4' '1'
<142>1 2018-07-03T18:40:27.239778+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2451"] 4815:amf/amfd/clm.cc:299 IN clm_track_cb:
CLM node 'safNode=PL-2-3,safCluster=myClmCluster' is not an AMF
cluster member; MDS down received
<143>1 2018-07-03T18:40:27.239792+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2452"] 4815:amf/amfd/clm.cc:418 <<
clm_track_cb
<143>1 2018-07-03T18:40:27.243287+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2465"] 4815:amf/amfd/ndfsm.cc:772 >>
avd_mds_avnd_down_evh: 2030f, (nil)
<143>1 2018-07-03T18:40:27.243307+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2466"] 4815:amf/amfd/ndfsm.cc:821 <<
avd_mds_avnd_down_evh
<143>1 2018-07-03T18:41:37.965003+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2596"] 4815:amf/amfd/clm.cc:223 >> clm_track_cb:
'0' '4' '1'
<143>1 2018-07-03T18:41:37.965104+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2605"] 4815:amf/amfd/clm.cc:418 <<
clm_track_cb
<143>1 2018-07-03T18:41:45.016379+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2631"] 4815:amf/amfd/ndfsm.cc:750 TR
amfnd on 2030f is up
<143>1 2018-07-03T18:41:45.139262+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="2686"] 4815:amf/amfd/ndfsm.cc:772 >>
avd_mds_avnd_down_evh: 2030f, 0x7ff9ac3dc940
<143>1 2018-07-03T18:41:45.141883+10:00 SC-2-1 osafamfd 4815
osafamfd [meta sequenceId="3008"] 4815:amf/amfd/ndfsm.cc:821 <<
avd_mds_avnd_down_evh
Best Regards,
Thuan
-----Original Message-----
From: Minh Hon Chau <minh.c...@dektech.com.au>
Sent: Monday, July 9, 2018 7:15 AM
To: thuan.tran <thuan.t...@dektech.com.au>;
nagen...@hasolutions.in; hans.nordeb...@ericsson.com;
gary....@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net
Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to
avoid message ID mismatch [#2891]
Hi,
In normal situation when I stop Opensaf or reboot the node, I often
get MDS down first, then CLM track callback. [1]
The problem in this ticket is because AMFD has CLM track callback
coming before MDS down. [2]
We need to see it as a race condition, not because of missing the
counters reset, so that the amfd's function to deal with MDS down
should be called. That would be equivalent to [1], which is
currently working.
With this patch, we still don't have avd_mds_avnd_down_evh() to be
called, that function processes many things in the event of mds down.
(ie. @nodes_exit_cnt, ...)
I think the node_info.member should be set TRUE or FALSE in the
event of CLM callback since it means CLM membership. We can use
both @member and one of variables indicating MDS down, to trap each
other in the events of MDS down and CLM callback, so that in the
end we have
avd_mds_avnd_down_evh() and nodeid is removed out of node_id_db in
both [1] and [2].
Thanks,
Minh
On 06/07/18 20:36, thuan.tran wrote:
There is a case that after AMFD send reboot order due to “out of
sync window”.
AMFD receive CLM track callback but node is not member yet and
delete node.
Later AMFND MDS down will not reset msg_id counter since it cannot
find node.
When node reboot up, AMFD continue use current msg_id counter send
to AMFND cause messasge ID mismatch in AMFND then AMFND order
reboot itself node.
---
src/amf/amfd/clm.cc | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/src/amf/amfd/clm.cc b/src/amf/amfd/clm.cc index
e113a65f9..4a15d5ad7 100644
--- a/src/amf/amfd/clm.cc
+++ b/src/amf/amfd/clm.cc
@@ -316,9 +316,14 @@ static void clm_track_cb(
__FUNCTION__, node_name.c_str());
goto done;
} else if (node->node_state == AVD_AVND_STATE_ABSENT) {
- LOG_IN("%s: CLM node '%s' is not an AMF cluster member;
MDS down received",
+ LOG_IN("%s: CLM node '%s' is ABSENT; MDS down
+received",
__FUNCTION__, node_name.c_str());
avd_node_delete_nodeid(node);
+ /* Reset msg_id because AVND MDS down may come later
+ and cannot find node to reset these, cause message
+ ID
mismatch. */
+ node->rcv_msg_id = 0;
+ node->snd_msg_id = 0;
+ node->node_info.member = SA_FALSE;
goto done;
}
TRACE(" Node Left: rootCauseEntity %s for node %u",
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel