Re: [devel] [PATCH 1 of 1] clm: avoid stale node down processing and unexpected track callback [#1120]

Hans Nordebäck Tue, 23 Sep 2014 00:26:44 -0700

Hi Mathi, I added the patch below and I can still reproduce the fault./Regards 
HansN


-----Original Message-----
From: Mathivanan Naickan Palanivelu [mailto:[email protected]] 
Sent: den 23 september 2014 09:13
To: Hans Feldt
Cc: [email protected]; Hans Nordebäck
Subject: Re: [devel] [PATCH 1 of 1] clm: avoid stale node down processing and 
unexpected track callback [#1120]

If the node restarted so quickly, then the fix for this is as below:

diff --git a/osaf/services/saf/clmsv/clms/clms_evt.c 
b/osaf/services/saf/clmsv/clms/clms_evt.c
--- a/osaf/services/saf/clmsv/clms/clms_evt.c
+++ b/osaf/services/saf/clmsv/clms/clms_evt.c
@@ -539,6 +539,12 @@ static uint32_t proc_rda_evt(CLMSV_CLMS_
 
                        /* fail over, become implementer */
                        clms_imm_impl_set(clms_cb);
+                       
+                       /* Process agent down first. It is quite possible that 
the agent
+                        * downs are for the agents that were running on the 
same node
+                        * which went down and which will be processed in the 
next line
+                        */
+                       clms_process_clma_down_list();
 
                        /* Process node downs during failover */
                        proc_downs_during_rolechange();
@@ -546,7 +552,6 @@ static uint32_t proc_rda_evt(CLMSV_CLMS_
                }
 
        }
-       clms_process_clma_down_list();
  done:
        TRACE_LEAVE();
        return rc;


HansN, could you please try this, In the mean time, i too will try using your 
rand() logic.

Thanks,
Mathi.


----- [email protected] wrote:

> Isn't that a race because restart in UML goes so unreasonable quick 
> that we cannot handle it?
> When CLM is handling the node down of PL4 it is already up again or 
> something /HansF
> 
> > -----Original Message-----
> > From: Hans Nordebäck
> > Sent: den 23 september 2014 08:59
> > To: Mathivanan Naickan Palanivelu
> > Cc: Hans Feldt; [email protected]
> > Subject: RE: [devel] [PATCH 1 of 1] clm: avoid stale node down
> processing and unexpected track callback [#1120]
> > 
> > yes, it was sent after the node come up, in this case payload 4.
> > It is easily reproduced using the power-off code and  stopping and
> starting a couple of payloads.
> > 
> >  /Regards HansN
> > 
> > -----Original Message-----
> > From: Mathivanan Naickan Palanivelu
> [mailto:[email protected]]
> > Sent: den 23 september 2014 08:54
> > To: Hans Nordebäck
> > Cc: Hans Feldt; [email protected]
> > Subject: Re: [devel] [PATCH 1 of 1] clm: avoid stale node down
> processing and unexpected track callback [#1120]
> > 
> > 
> > So, payload 4 and payload 3 were rebooted (stopped and started).
> > And, it appears that the trackcallback was sent to 4 after that node
> came up again(post reboot).
> > Is that what is happening(i mean, looking at the stamp)?
> > 
> > Thanks,
> > Mathi.
> > 
> > ----- [email protected] wrote:
> > 
> > > I added the following:
> > >
> > > void clms_track_send_node_down(CLMS_CLUSTER_NODE *node)
> > >     :
> > >
> > >      if(clms_cb->ha_state == SA_AMF_HA_ACTIVE){
> > >          if (!(rand() %  7)) {
> > >              reboot(0x4321fedc /*LINUX_REBOOT_CMD_POWER_OFF*/);
> > >          }
> > > :
> > >
> > > to simulate a real case where the SC-1 was powered off, and the
> pc
> > > probably at this address.
> > >
> > > ./opensaf nodestop 4
> > > ./opensaf nodestart 4
> > > ./opensaf nodestop 3
> > > ./opensaf nodestart 3
> > >
> > > /HansN
> > >
> > >
> > >
> > > On 09/23/14 08:13, Hans Feldt wrote:
> > > > What node did originally leave the cluster since you ended up
> in
> > > clms_track_send_node_down?
> > > >
> > > > You must have killed first one node (PLx?) and then the active
> SC
> > > was rebooted in the middle of handling this.
> > > >
> > > > /HansF
> > > >
> > > >> -----Original Message-----
> > > >> From: Hans Nordebäck [mailto:[email protected]]
> > > >> Sent: den 23 september 2014 08:06
> > > >> To: [email protected]
> > > >> Cc: [email protected]
> > > >> Subject: Re: [devel] [PATCH 1 of 1] clm: avoid stale node down
> > > processing and unexpected track callback [#1120]
> > > >>
> > > >> Hi Mathi, I tested the patch and SA_CLM_NODE_LEFT are sent to
> > > active node:
> > > >>
> > > >> ./rootfs/var/PL-4/log/messages:Sep 23 07:31:23 PL-4
> local0.notice
> > > >> osafamfnd[391]: NO This node has exited the cluster
> > > >>
> > > >> if run in UML and power off is done in
> clms_track_send_node_down
> > > before
> > > >> sending checkpoint data.
> > > >>
> > > >> /Regards HansN
> > > >>
> > > >> On 09/23/14 02:12, [email protected] wrote:
> > > >>>    osaf/services/saf/clmsv/clms/clms_cb.h  |   6 ++++
> > > >>>    osaf/services/saf/clmsv/clms/clms_evt.c |  48
> > > ++++++++++++++++++++++++++++----
> > > >>>    2 files changed, 47 insertions(+), 7 deletions(-)
> > > >>>
> > > >>>
> > > >>> There is a possiblity that the checkpointing message for a
> > > NODE_DOWN reaches the STANDBY first, i.e.
> > > >>> before the MDS delivers the NODE_DOWN event the the standby.
> > > >>>    This can result in stale node_down record getting stored in
> the
> > > node_down list which is a designated list
> > > >>> for processing of node downs that occur during role change
> from
> > > standby to active.
> > > >>> The patch introduces a variable that checks whether the
> checkpoint
> > > event for node_down has
> > > >>> arrived first, followed by a check during role change to
> ignore
> > > such stale events.
> > > >>>
> > > >>> diff --git a/osaf/services/saf/clmsv/clms/clms_cb.h
> > > b/osaf/services/saf/clmsv/clms/clms_cb.h
> > > >>> --- a/osaf/services/saf/clmsv/clms/clms_cb.h
> > > >>> +++ b/osaf/services/saf/clmsv/clms/clms_cb.h
> > > >>> @@ -37,6 +37,11 @@ typedef enum {
> > > >>>       IMM_RECONFIGURED = 5
> > > >>>    } ADMIN_OP;
> > > >>>
> > > >>> +typedef enum {
> > > >>> +     CHECKPOINT_PROCESSED = 1,
> > > >>> +     MDS_DOWN_PROCESSED
> > > >>> +} NODE_DOWN_STATUS;
> > > >>> +
> > > >>>    /* Cluster Properties */
> > > >>>    typedef struct cluster_db_t {
> > > >>>       SaNameT name;
> > > >>> @@ -124,6 +129,7 @@ typedef struct clma_down_list_tag {
> > > >>>
> > > >>>    typedef struct node_down_list_tag {
> > > >>>       SaClmNodeIdT node_id;
> > > >>> +     NODE_DOWN_STATUS ndown_status;
> > > >>>       struct node_down_list_tag *next;
> > > >>>    } NODE_DOWN_LIST;
> > > >>>
> > > >>> diff --git a/osaf/services/saf/clmsv/clms/clms_evt.c
> > > b/osaf/services/saf/clmsv/clms/clms_evt.c
> > > >>> --- a/osaf/services/saf/clmsv/clms/clms_evt.c
> > > >>> +++ b/osaf/services/saf/clmsv/clms/clms_evt.c
> > > >>> @@ -471,6 +471,13 @@ void clms_track_send_node_down(CLMS_CLUS
> > > >>>       node->stat_change = SA_TRUE;
> > > >>>       node->change = SA_CLM_NODE_LEFT;
> > > >>>       ++(clms_cb->cluster_view_num);
> > > >>> +
> > > >>> +     if(clms_cb->ha_state == SA_AMF_HA_ACTIVE){
> > > >>> +             ckpt_node_rec(node);
> > > >>> +             ckpt_node_down_rec(node);
> > > >>> +             ckpt_cluster_rec();
> > > >>> +     }
> > > >>> +
> > > >>>       clms_send_track(clms_cb, node, SA_CLM_CHANGE_COMPLETED);
> > > >>>       /* Clear node->stat_change after sending the callback to
> its
> > > clients */
> > > >>>       node->stat_change = SA_FALSE; @@ -480,12 +487,6 @@ void 
> > > >>> clms_track_send_node_down(CLMS_CLUS
> > > >>>       clms_node_update_rattr(node);
> > > >>>       clms_cluster_update_rattr(osaf_cluster);
> > > >>>
> > > >>> -     if(clms_cb->ha_state == SA_AMF_HA_ACTIVE){
> > > >>> -             ckpt_node_rec(node);
> > > >>> -             ckpt_node_down_rec(node);
> > > >>> -             ckpt_cluster_rec();
> > > >>> -     }
> > > >>> -
> > > >>>       /*For the NODE DOWN, boottimestamp will not be updated */
> > > >>>
> > > >>>       /* Delete the node reference from the nodeid database */
> @@
> > > >>> -592,6 +593,7 @@ static uint32_t proc_mds_node_evt(CLMSV_
> > > >>>                               clms_cb->node_down_list_tail->next = 
> > > >>> node_down_rec;
> > > >>>               }
> > > >>>               clms_cb->node_down_list_tail = node_down_rec;
> > > >>> +             node_down_rec->ndown_status = MDS_DOWN_PROCESSED;
> > > >>>       }
> > > >>>
> > > >>>     done:
> > > >>> @@ -1613,6 +1615,7 @@ void
> clms_remove_node_down_rec(SaClmNode
> > > >>>    {
> > > >>>       NODE_DOWN_LIST *node_down_rec =
> clms_cb->node_down_list_head;
> > > >>>       NODE_DOWN_LIST *prev_rec = NULL;
> > > >>> +     bool record_found = false;
> > > >>>
> > > >>>       while (node_down_rec) {
> > > >>>               if (node_down_rec->node_id == node_id) { @@ -1638,11
> +1641,36
> > > >>> @@ void clms_remove_node_down_rec(SaClmNode
> > > >>>                       /* Free the NODE_DOWN_REC */
> > > >>>                       free(node_down_rec);
> > > >>>                       node_down_rec = NULL;
> > > >>> +                     record_found = true;
> > > >>>                       break;
> > > >>>               }
> > > >>>               prev_rec = node_down_rec;       /* Remember address of 
> > > >>> this
> entry
> > > */
> > > >>>               node_down_rec = node_down_rec->next;    /* Go to next 
> > > >>> entry
> */
> > > >>>       }
> > > >>> +
> > > >>> +     if (!record_found) {
> > > >>> +             /* MDS node_down has not yet reached the STANDBY,
> > > >>> +              * Just add this checkupdate record to the list. 
> > > >>> MDS_DOWN
> > > processing will delete it.
> > > >>> +              * If role change happens before MDS_DOWN is recieved,
> > > >>> +              * then role change processing just ignores the record 
> > > >>> and
> > > removes it
> > > >>> +              * from the list.
> > > >>> +              */
> > > >>> +             node_down_rec = NULL;
> > > >>> +             if ((node_down_rec = (NODE_DOWN_LIST *)
> > > malloc(sizeof(NODE_DOWN_LIST))) == NULL) {
> > > >>> +                     LOG_CR("Memory Allocation for NODE_DOWN_LIST 
> > > >>> failed");
> > > >>> +                     return;
> > > >>> +             }
> > > >>> +             memset(node_down_rec, 0, sizeof(NODE_DOWN_LIST));
> > > >>> +             node_down_rec->node_id = node_id;
> > > >>> +             if (clms_cb->node_down_list_head == NULL) {
> > > >>> +                     clms_cb->node_down_list_head = node_down_rec;
> > > >>> +             } else {
> > > >>> +                     if (clms_cb->node_down_list_tail)
> > > >>> +                             clms_cb->node_down_list_tail->next = 
> > > >>> node_down_rec;
> > > >>> +             }
> > > >>> +             clms_cb->node_down_list_tail = node_down_rec;
> > > >>> +             node_down_rec->ndown_status = CHECKPOINT_PROCESSED;
> > > >>> +     }
> > > >>>    }
> > > >>>
> > > >>>    /**
> > > >>> @@ -1696,7 +1724,13 @@ void proc_downs_during_rolechange
> (void)
> > > >>>               /*Remove NODE_DOWN_REC from the NODE_DOWN_LIST */
> > > >>>               node = clms_node_get_by_id(node_down_rec->node_id);
> > > >>>               temp_node_down_rec = node_down_rec;
> > > >>> -             if (node != NULL)
> > > >>> +             /* If nodedown status is CHECKPOINT_PROCESSED, it means
> that
> > > >>> +              * a checkpoint update was received when this node was
> STANDBY,
> > > but
> > > >>> +              * the MDS node_down did not reach the STANDBY. An
> extremely
> > > rare chance,
> > > >>> +              * but good to have protection for it, by ignoring the
> record
> > > >>> +              * if the record is in CHECKPOINT_PROCESSED state.
> > > >>> +              */
> > > >>> +             if ((node != NULL) && (temp_node_down_rec->ndown_status 
> > > >>> !=
> > > CHECKPOINT_PROCESSED))
> > > >>>                       clms_track_send_node_down(node);
> > > >>>               node_down_rec = node_down_rec->next;
> > > >>>               /*Free the NODE_DOWN_REC */
> > > >>
> > > >>
> > >
> ----------------------------------------------------------------------
> > > --------
> > > >> Meet PCI DSS 3.0 Compliance Requirements with EventLog
> Analyzer
> > > >> Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI
> DSS
> > > Reports
> > > >> Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White
> > > paper
> > > >> Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog
> > > Analyzer
> > > >>
> > >
> http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.
> > > clktrk
> > > >> _______________________________________________
> > > >> Opensaf-devel mailing list
> > > >> [email protected]
> > > >> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1 of 1] clm: avoid stale node down processing and unexpected track callback [#1120]

Reply via email to