"With this patch, we still don't have avd_mds_avnd_down_evh() to be
called, that function processes many things in the event of mds down.
(ie. @nodes_exit_cnt, ...)"
I mean we do receive mds down event in [2], and avd_mds_avnd_down_evh()
is called, but we miss to handle the node being down since we can't find
it with nodeid.
On 09/07/18 10:14, Minh Hon Chau wrote:
Hi,
In normal situation when I stop Opensaf or reboot the node, I often
get MDS down first, then CLM track callback. [1]
The problem in this ticket is because AMFD has CLM track callback
coming before MDS down. [2]
We need to see it as a race condition, not because of missing the
counters reset, so that the amfd's function to deal with MDS down
should be called. That would be equivalent to [1], which is currently
working.
With this patch, we still don't have avd_mds_avnd_down_evh() to be
called, that function processes many things in the event of mds down.
(ie. @nodes_exit_cnt, ...)
I think the node_info.member should be set TRUE or FALSE in the event
of CLM callback since it means CLM membership. We can use both @member
and one of variables indicating MDS down, to trap each other in the
events of MDS down and CLM callback, so that in the end we have
avd_mds_avnd_down_evh() and nodeid is removed out of node_id_db in
both [1] and [2].
Thanks,
Minh
On 06/07/18 20:36, thuan.tran wrote:
There is a case that after AMFD send reboot order due to “out of sync
window”.
AMFD receive CLM track callback but node is not member yet and delete
node.
Later AMFND MDS down will not reset msg_id counter since it cannot
find node.
When node reboot up, AMFD continue use current msg_id counter send to
AMFND
cause messasge ID mismatch in AMFND then AMFND order reboot itself node.
---
src/amf/amfd/clm.cc | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/src/amf/amfd/clm.cc b/src/amf/amfd/clm.cc
index e113a65f9..4a15d5ad7 100644
--- a/src/amf/amfd/clm.cc
+++ b/src/amf/amfd/clm.cc
@@ -316,9 +316,14 @@ static void clm_track_cb(
__FUNCTION__, node_name.c_str());
goto done;
} else if (node->node_state == AVD_AVND_STATE_ABSENT) {
- LOG_IN("%s: CLM node '%s' is not an AMF cluster member;
MDS down received",
+ LOG_IN("%s: CLM node '%s' is ABSENT; MDS down received",
__FUNCTION__, node_name.c_str());
avd_node_delete_nodeid(node);
+ /* Reset msg_id because AVND MDS down may come later
+ and cannot find node to reset these, cause message ID
mismatch. */
+ node->rcv_msg_id = 0;
+ node->snd_msg_id = 0;
+ node->node_info.member = SA_FALSE;
goto done;
}
TRACE(" Node Left: rootCauseEntity %s for node %u",
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel