On receiving a node_up, and the node id is rather than the active one, if that node has 2N Opensaf SU, we can mark it once as to-be-standby and exclude the count from it. But it does not look nice and complicate the code.

But I doubt it would be a problem as it's rare ... unless SC-2 is really slow to start and the expectation of the whole *startup* is the same.

Anyway you will have a better idea :) if it comes up as a real problem.


On 30/10/18 6:42 pm, Gary Lee wrote:
I don't have any good ideas right now. The old check isn't reliable, but we can 
think about it for later.

It should be a rare case anyway, and 10s is probably not that noticeable when 
you consider start up times of other apps.

On 30 Oct 2018, at 6:29 pm, Minh Hon Chau <[email protected]> wrote:

Hi Nagu,

It's fine with me too, or Gary knows how to exclude the wait of SC-2.

Thanks

Minh

On 30/10/18 6:14 pm, Nagendra Kumar wrote:
Hi Minh,
          I thought, it would be rare. But if you find that it is breaking your 
existing functionality(backward compatibility) i.e. delaying the Cluster 
Startup i.e. earlier it used to take say 2 seconds, now it takes 10 seconds 
with one controller, then Gary can retain the older code. It is fine with me.

Thanks
-Nagu

-----Original Message-----
From: Minh Hon Chau [mailto:[email protected]]
Sent: 30 October 2018 12:40
To: Nagendra Kumar; 'Gary Lee'; [email protected]
Cc: [email protected]
Subject: Re: [PATCH 1/1] amfd: ensure node_sync_window_closed is set [#2946]

Hi Nagu,

The subsequent procedures will be delayed, including the assignments
too, to wait for SC2 to join, it is nearly 10 secs I guess. The headless
sync does not need to wait for standby amfd, so here the code was not
expecting to wait for SC-2. I think this scenario is rare?

Thanks

Minh


On 30/10/18 4:43 pm, Nagendra Kumar wrote:
Hi Minh,
          I had noticed that point while review. But, if both SCs have gone 
down, then expected is both should join.
If only one SC starts, then yes timeout will happen. Do you see any major 
implications than assignments delay, which I think should be fine because, the 
expected delay is waiting for SC-2 to join?

Thanks
-Nagu

-----Original Message-----
From: Minh Hon Chau [mailto:[email protected]]
Sent: 30 October 2018 02:41
To: Nagendra Kumar; 'Gary Lee'; [email protected]
Cc: [email protected]
Subject: Re: [PATCH 1/1] amfd: ensure node_sync_window_closed is set [#2946]

Hi Gary, Nagu

One notice you may know from the patch.

If we have two SCs cluster, go headless, only start SC1, now the
headless sync will be always timeout to wait for SC2 up.

Thanks

Minh

On 29/10/18 7:19 pm, Nagendra Kumar wrote:
Hi Gary,
             Great simplification!. Ack.

Thanks
-Nagu

-----Original Message-----
From: Gary Lee [mailto:[email protected]]
Sent: 29 October 2018 12:36
To: [email protected]; [email protected]; Nagendra Kumar
Cc: [email protected]; Gary Lee
Subject: [PATCH 1/1] amfd: ensure node_sync_window_closed is set [#2946]

If all nodes are synced after headless, the timer is stopped
but node_sync_window_closed is never set to true.

Later on, if a node becomes split from the main network and
rejoins, it will send a headless sync to amfd.

amfd will go into a never ending loop of processing the message,
putting back into the queue, etc.

When the node sync timer is stopped, ensure node_sync_window_closed
is set.

Also modify avd_count_node_up() not to count standby SC.
Sometimes a node_up from the standby SC arrives before mds up,
and the stadnby SC is incorrectly included in the node sync
count. Then a legitimate node_up from a PL is not accepted
because node_sync_window_closed is prematurely set.
---
    src/amf/amfd/ndfsm.cc | 28 +++-------------------------
    1 file changed, 3 insertions(+), 25 deletions(-)

diff --git a/src/amf/amfd/ndfsm.cc b/src/amf/amfd/ndfsm.cc
index edc993988..375c5c7b1 100644
--- a/src/amf/amfd/ndfsm.cc
+++ b/src/amf/amfd/ndfsm.cc
@@ -165,34 +165,12 @@ done:
     *
    **************************************************************************/
    uint32_t avd_count_sync_node_size(AVD_CL_CB *cb) {
-  uint32_t twon_ncs_su_count = 0;
      uint32_t count = 0;
      TRACE_ENTER();
    -  for (const auto &value : *node_name_db) {
-    AVD_AVND *avnd = value.second;
-    osafassert(avnd);
-    for (const auto &su : avnd->list_of_ncs_su) {
-      if (su->sg_of_su->sg_redundancy_model == SA_AMF_2N_REDUNDANCY_MODEL)
{
-        twon_ncs_su_count++;
-        continue;
-      }
-    }
-  }
-  // cluster can have 1 SC or more SCs which hosting 2N Opensaf SU
-  // so twon_ncs_su_count at least is 1
-  osafassert(twon_ncs_su_count > 0);
-
-  if (twon_ncs_su_count == 1) {
-    // 1 SC, the rest of nodes could be in sync from headless
-    count = node_name_db->size() - 1;
-  } else {
-    // >=2 SCs, the rest of nodes could be in sync except active/standby SC
-    count = node_name_db->size() - 2;
-  }
+  count = node_name_db->size() - 1;
          TRACE("sync node size:%d", count);
-  TRACE_LEAVE();
      return count;
    }
    /***************************************************************************
**
@@ -218,8 +196,7 @@ uint32_t avd_count_node_up(AVD_CL_CB *cb) {
      for (const auto &value : *node_name_db) {
        node = value.second;
        if (node->node_up_msg_count > 0 &&
-        node->node_info.nodeId != cb->node_id_avd &&
-        node->node_info.nodeId != cb->node_id_avd_other)
+        node->node_info.nodeId != cb->node_id_avd)
          ++received_count;
      }
      TRACE("Number of node director(s) that director received node_up msg:%u",
@@ -329,6 +306,7 @@ void avd_node_up_evh(AVD_CL_CB *cb, AVD_EVT *evt) {
          if (cb->node_sync_tmr.is_active) {
            avd_stop_tmr(cb, &cb->node_sync_tmr);
            TRACE("stop NodeSync timer");
+        cb->node_sync_window_closed = true;
          }
          cb->all_nodes_synced = true;
          LOG_NO("Received node_up_msg from all nodes");



_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to