Hans, ACK after fixing Anders' and two of my comments inlined as [Mathi]. Thanks, Mathi.
> -----Original Message----- > From: Hans Nordebäck [mailto:hans.nordeb...@ericsson.com] > Sent: Friday, September 30, 2016 3:38 PM > To: Mathivanan Naickan Palanivelu; anders.wid...@ericsson.com > Cc: opensaf-devel@lists.sourceforge.net > Subject: Re: [PATCH 1 of 1] clm: add support for cluster reboot V3 [#2053] > > Hi Mathi, > > see comments inline below. Do you think we can push this with AndersW > comments incorporated? > > /Thanks HansN > > > On 09/30/2016 11:26 AM, Mathivanan Naickan Palanivelu wrote: > > Thanks for the details and I agree. > > I was actually enquiring about > > - why a new "opensaf_safe_reboot" option in the opensaf_reboot script > > and > > - Why a osaf_safe_reboot() is added in osaf_utility.c instead of using the > opensaf_reboot() function itself. > [HansN] after CLM has requested safe_reboot, processing should continue > normally, i.e. not stop in e.g. > the use_fallback loop in opensaf_reboot, as then AMF probably will start > error handling/recovery etc. > > > > For the first question, I think the obvious guess would be that we want to > use 'shutdown -r' such that rc scripts are invoked when doing the reboot. > [HansN] yes, we have to separate between shutdown and reboot -f. > > > > Thanks, > > Mathi. > > > >> -----Original Message----- > >> From: Hans Nordebäck [mailto:hans.nordeb...@ericsson.com] > >> Sent: Friday, September 30, 2016 2:14 PM > >> To: Mathivanan Naickan Palanivelu; anders.wid...@ericsson.com > >> Cc: opensaf-devel@lists.sourceforge.net > >> Subject: Re: [PATCH 1 of 1] clm: add support for cluster reboot V3 > >> [#2053] > >> > >> Hi Mathi, > >> > >> ordering a cluster reboots are done at several places, both outside > >> OpenSAF and inside OpenSAF e.g. SMF. > >> > >> There are several different solutions to order a cluster reboot and > >> in many cases they don't work well, so it would be > >> > >> good if OpenSAF could provide one way to perform a cluster reboot. > >> AMF may also use this for implementing SA_AMF_CLUSTER_RESET. > >> > >> One common use case is upgrade with a following cluster reboot. The > >> payloads are normally PXE booted and the DHCP > >> > >> is stopped at the controllers before ordering a cluster reboot so the > >> payloads will not start until the controllers has rebooted and DHCP is > started. > >> > >> Then each node is rebooted by in sequence do ssh to each node and > >> order a reboot. This may take time and if CLM can broadcast this > >> reboot request the > >> > >> reboot will be considerable faster. So therefor in the first version > >> of the cluster reboot support I don't think we need to consider > >> implementing phases, this can > >> > >> be added later, with e.g. a flag in CLM as AndersW suggested. > >> > >> /Thanks HansN > >> > >> > >> On 09/30/2016 09:43 AM, Mathivanan Naickan Palanivelu wrote: > >>> Hi Hans, > >>> > >>> Could you provide some background on the need for > >> opensaf_safe_reboot(). > >>> What would be the need for this? > >>> > >>> Thanks, > >>> Mathi. > >>> > >>> > >>>> -----Original Message----- > >>>> From: Hans Nordeback [mailto:hans.nordeb...@ericsson.com] > >>>> Sent: Wednesday, September 28, 2016 5:26 PM > >>>> To: anders.wid...@ericsson.com; Mathivanan Naickan Palanivelu > >>>> Cc: opensaf-devel@lists.sourceforge.net > >>>> Subject: [PATCH 1 of 1] clm: add support for cluster reboot V3 > >>>> [#2053] > >>>> > >>>> osaf/libs/common/clmsv/include/clmsv_msg.h | 6 +++ > >>>> osaf/libs/core/common/include/osaf_utility.h | 5 +++ > >>>> osaf/libs/core/common/osaf_utility.c | 22 +++++++++++++ > >>>> osaf/services/saf/clmsv/clms/clms.h | 3 +- > >>>> osaf/services/saf/clmsv/clms/clms_imm.c | 18 ++++++++++ > >>>> osaf/services/saf/clmsv/clms/clms_mds.c | 46 > >>>> +++++++++++++++++++++++++++- > >>>> osaf/services/saf/clmsv/clms/clms_util.c | 12 +++++++ > >>>> osaf/services/saf/clmsv/nodeagent/main.c | 12 +++++++ > >>>> scripts/opensaf_reboot | 22 ++++++++++--- > >>>> 9 files changed, 139 insertions(+), 7 deletions(-) > >>>> > >>>> > >>>> Admin command to request cluster reboot: > >>>> immadm -o 1 safCluster=myClmCluster > >>>> > >>>> diff --git a/osaf/libs/common/clmsv/include/clmsv_msg.h > >>>> b/osaf/libs/common/clmsv/include/clmsv_msg.h > >>>> --- a/osaf/libs/common/clmsv/include/clmsv_msg.h > >>>> +++ b/osaf/libs/common/clmsv/include/clmsv_msg.h > >>>> @@ -23,6 +23,7 @@ typedef enum clms_msg_type { > >>>> CLMSV_CLMS_TO_CLMA_CBK_MSG, > >>>> CLMSV_CLMS_TO_CLMA_API_RESP_MSG, > >>>> CLMSV_CLMS_TO_CLMA_IS_MEMBER_MSG, > >>>> + CLMSV_CLMS_TO_CLMNA_REBOOT_MSG, > >>>> CLMSV_MSG_MAX > >>>> } CLMSV_MSG_TYPE; > >>>> > >>>> @@ -174,6 +175,10 @@ typedef struct clmsv_is_member_info_t { > >>>> SaUint32T client_id; > >>>> }CLMSV_IS_MEMBER_INFO; > >>>> > >>>> +typedef struct clmsv_reboot_info_t { > >>>> + SaClmNodeIdT node_id; > >>>> +} CLMSV_REBOOT_INFO; > >>>> + > >>>> /* Top Level CLMSv MDS message structure for use between CLMS-> > >>>> CLMA && CLMA -> CLMS */ typedef struct clmsv_msg_t { > >>>> struct clmsv_msg_t *next; /* Mailbox processing */ > >>>> @@ -183,6 +188,7 @@ typedef struct clmsv_msg_t { > >>>> CLMSV_CBK_INFO cbk_info; /* Callback Messages from CLMS to > >> CLA > >>>> */ > >>>> CLMSV_API_RESP_INFO api_resp_info; /* Response Messages > from > >>>> CLMS to CLA */ > >>>> CLMSV_IS_MEMBER_INFO is_member_info; /*Is node member > or > >> not > >>>> Message from CLMS to CLA*/ > >>>> + CLMSV_REBOOT_INFO reboot_info; /* Reboot request from > >>>> CLMS to CLMNA */ > >>>> } info; > >>>> } CLMSV_MSG; > >>>> > >>>> diff --git a/osaf/libs/core/common/include/osaf_utility.h > >>>> b/osaf/libs/core/common/include/osaf_utility.h > >>>> --- a/osaf/libs/core/common/include/osaf_utility.h > >>>> +++ b/osaf/libs/core/common/include/osaf_utility.h > >>>> @@ -24,6 +24,8 @@ > >>>> #ifndef OPENSAF_CORE_OSAF_UTILITY_H_ > >>>> #define OPENSAF_CORE_OSAF_UTILITY_H_ > >>>> > >>>> +#define USE_SAFE_REBOOT 1 > >>>> + [Mathi] I think this should be OSAF_USE_SAFE_REBOOT > >>>> #include <pthread.h> > >>>> > >>>> #ifdef __cplusplus > >>>> @@ -68,6 +70,9 @@ extern void osaf_abort(long i_cause) #endif > >>>> nothrow, noreturn)); > >>>> > >>>> +extern void osaf_safe_reboot() > >>>> + __attribute__ ((nothrow)); > >>>> + > >>>> static inline void osaf_mutex_lock_ordie(pthread_mutex_t* > io_mutex) { > >>>> int result = pthread_mutex_lock(io_mutex); > >>>> if (result != 0) osaf_abort(result); diff --git > >>>> a/osaf/libs/core/common/osaf_utility.c > >>>> b/osaf/libs/core/common/osaf_utility.c > >>>> --- a/osaf/libs/core/common/osaf_utility.c > >>>> +++ b/osaf/libs/core/common/osaf_utility.c > >>>> @@ -16,9 +16,12 @@ > >>>> */ > >>>> > >>>> #include "osaf_utility.h" > >>>> +#include "ncssysf_def.h" > >>>> +#include "configmake.h" > >>>> #include <stdlib.h> > >>>> #include <errno.h> > >>>> #include <syslog.h> > >>>> +#include <stdio.h> > >>>> > >>>> void osaf_abort(long i_cause) > >>>> { > >>>> @@ -26,3 +29,22 @@ void osaf_abort(long i_cause) > >>>> i_cause, __builtin_return_address(0), errno); > >>>> abort(); > >>>> } > >>>> + > >>>> +void osaf_safe_reboot() > >>>> +{ > >>>> + char str[256]; > >>>> + > >>>> + snprintf(str, sizeof(str), PKGLIBDIR "/opensaf_reboot %u %s %u", > >>>> +0, > >>>> "not_used", USE_SAFE_REBOOT); > >>>> + syslog(LOG_NOTICE, "Reboot ordered using command: %s", str); > >>>> + > >>>> + int rc = system(str); > >>>> + if (rc < 0) { > >>>> + syslog(LOG_CRIT, "Node reboot failure: exit code %d", > >>>> WEXITSTATUS(rc)); > >>>> + } else { > >>>> + if (WIFEXITED(rc) && WEXITSTATUS(rc) == 0) { > >>>> + syslog(LOG_NOTICE, "Command: %s successfully > >>>> executed", str); > >>>> + } else { > >>>> + syslog(LOG_CRIT, "Command: %s failed with exit > >>>> code %d", str, WEXITSTATUS(rc)); > >>>> + } > >>>> + } > >>>> +} > >>>> diff --git a/osaf/services/saf/clmsv/clms/clms.h > >>>> b/osaf/services/saf/clmsv/clms/clms.h > >>>> --- a/osaf/services/saf/clmsv/clms/clms.h > >>>> +++ b/osaf/services/saf/clmsv/clms/clms.h > >>>> @@ -99,6 +99,7 @@ extern uint32_t clms_mds_msg_send(CLMS_C > >>>> MDS_DEST *dest, > >>>> MDS_SYNC_SND_CTXT *mds_ctxt, > >>>> MDS_SEND_PRIORITY_TYPE prio, NCSMDS_SVC_ID svc_id); > >>>> > >>>> +extern uint32_t clms_mds_msg_bcast(CLMS_CB *cb, CLMSV_MSG > >>>> *bcast_msg); > >>>> extern SaAisErrorT clms_imm_activate(CLMS_CB * cb); extern > >>>> uint32_t clms_node_trackresplist_empty(CLMS_CLUSTER_NODE * > >> op_node); > >>>> extern uint32_t clms_send_cbk_start_sub(CLMS_CB * cb, > >>>> CLMS_CLUSTER_NODE * node); @@ -125,5 +126,5 @@ extern void > >>>> clms_cb_dump(void); extern uint32_t > >> clms_send_is_member_info(CLMS_CB > >>>> * cb, SaClmNodeIdT node_id, SaBoolT member, SaBoolT > >>>> is_configured); extern void clm_imm_reinit_bg(CLMS_CB * cb); > >>>> extern void proc_downs_during_rolechange (void); > >>>> - > >>>> +extern void clms_cluster_reboot(); > >>>> #endif /* ifndef CLMS_H */ > >>>> diff --git a/osaf/services/saf/clmsv/clms/clms_imm.c > >>>> b/osaf/services/saf/clmsv/clms/clms_imm.c > >>>> --- a/osaf/services/saf/clmsv/clms/clms_imm.c > >>>> +++ b/osaf/services/saf/clmsv/clms/clms_imm.c > >>>> @@ -19,6 +19,7 @@ > >>>> > >>>> #include "clms.h" > >>>> #include "osaf_extended_name.h" > >>>> +#include "osaf_utility.h" > >>>> > >>>> extern struct ImmutilWrapperProfile immutilWrapperProfile; > >>>> > >>>> @@ -886,6 +887,23 @@ static void clms_imm_admin_op_callback(S > >>>> > >>>> TRACE_ENTER2("Admin callback for nodename:%s, opId:%llu", > >>>> objectName->value, opId); > >>>> > >>>> + // E.g. immadm -o 1 safCluster=myClmCluster > >>>> + if (strncmp(osaf_extended_name_borrow(objectName), > >>>> + osaf_extended_name_borrow(&osaf_cluster->name), > >>>> + osaf_extended_name_length(objectName)) == 0) { > >>>> + if (opId == 1) { > >>>> + LOG_WA("Cluster reboot requested. Ordering > >>>> cluster reboot"); > >>>> + // MDS broadcast/multi cast call is synchronous > >>>> + clms_cluster_reboot(); > >>>> + sleep(1); > >>>> + osaf_safe_reboot(); > >>>> + } else { > >>>> + LOG_ER("Admin Operation not supported for %s", > >>>> osaf_extended_name_borrow(objectName)); > >>>> + > >>>> immutil_saImmOiAdminOperationResult(immOiHandle, invocation, > >>>> SA_AIS_ERR_INVALID_PARAM); > >>>> + } > >>>> + goto done; > >>>> + } > >>>> + > >>>> /*Lookup by the node_name and get the cluster node for CLM > >> Admin > >>>> oper */ > >>>> nodeop = clms_node_get_by_name(objectName); > >>>> if (nodeop == NULL) { > >>>> diff --git a/osaf/services/saf/clmsv/clms/clms_mds.c > >>>> b/osaf/services/saf/clmsv/clms/clms_mds.c > >>>> --- a/osaf/services/saf/clmsv/clms/clms_mds.c > >>>> +++ b/osaf/services/saf/clmsv/clms/clms_mds.c > >>>> @@ -659,7 +659,17 @@ uint32_t clms_mds_enc(struct ncsmds_call > >>>> ncs_enc_claim_space(uba, 4); > >>>> total_bytes += 4; > >>>> > >>>> - if (CLMSV_CLMS_TO_CLMA_API_RESP_MSG == msg->evt_type) { > >>>> + if (CLMSV_CLMS_TO_CLMNA_REBOOT_MSG == msg->evt_type) { > >>>> + /* encode the reboot msg **/ > >>>> + p8 = ncs_enc_reserve_space(uba, 4); > >>>> + if (!p8) { > >>>> + TRACE("ncs_enc_reserve_space failed"); > >>>> + goto err; > >>>> + } > >>>> + ncs_encode_32bit(&p8, msg->info.reboot_info.node_id); > >>>> + ncs_enc_claim_space(uba, 4); > >>>> + total_bytes += 4; > >>>> + } else if (CLMSV_CLMS_TO_CLMA_API_RESP_MSG == msg- > >>>>> evt_type) { > >>>> /** encode the API RSP msg subtype **/ > >>>> p8 = ncs_enc_reserve_space(uba, 4); > >>>> if (!p8) { > >>>> @@ -1517,3 +1527,37 @@ uint32_t clms_mds_msg_send(CLMS_CB * > cb, > >>>> TRACE_LEAVE(); > >>>> return rc; > >>>> } > >>>> + > >>>> > >> > +/********************************************************* > >>>> ******************* > >>>> + Name : clms_mds_msg_bcast > >>>> + > >>>> + Description : This routine sends a broadcast message to CLMNA. > >>>> + > >>>> + Arguments : cb - ptr to the CLMA CB > >>>> + bcast_msg - ptr to the CLMSv broadcast message > >>>> + > >>>> + Return Values : NCSCC_RC_SUCCESS/NCSCC_RC_FAILURE > >>>> + > >>>> + Notes : None. > >>>> > >> > +********************************************************* > >>>> ************** > >>>> +*******/ uint32_t clms_mds_msg_bcast(CLMS_CB *cb, > CLMSV_MSG > >>>> *bcast_msg) > >>>> +{ > >>>> + NCSMDS_INFO snd_mds = {0}; > >>>> + uint32_t rc; > >>>> + > >>>> + snd_mds.i_mds_hdl = cb->mds_hdl; > >>>> + snd_mds.i_svc_id = NCSMDS_SVC_ID_CLMS; > >>>> + snd_mds.i_op = MDS_SEND; > >>>> + snd_mds.info.svc_send.i_msg = (NCSCONTEXT)bcast_msg; > >>>> + snd_mds.info.svc_send.i_to_svc = NCSMDS_SVC_ID_CLMNA; > >>>> + snd_mds.info.svc_send.i_priority = MDS_SEND_PRIORITY_HIGH; > >>>> + snd_mds.info.svc_send.i_sendtype = MDS_SENDTYPE_BCAST; > >>>> + snd_mds.info.svc_send.info.bcast.i_bcast_scope = > >>>> NCSMDS_SCOPE_NONE; > >>>> + > >>>> + if ((rc = ncsmds_api(&snd_mds)) != NCSCC_RC_SUCCESS) { > >>>> + LOG_ER("%s: ncsmds_api MDS_SEND failed %u", > >>>> __FUNCTION__ ,rc); > >>>> + return rc; > >>>> + } > >>>> + > >>>> + return NCSCC_RC_SUCCESS; > >>>> +} > >>>> \ No newline at end of file > >>>> diff --git a/osaf/services/saf/clmsv/clms/clms_util.c > >>>> b/osaf/services/saf/clmsv/clms/clms_util.c > >>>> --- a/osaf/services/saf/clmsv/clms/clms_util.c > >>>> +++ b/osaf/services/saf/clmsv/clms/clms_util.c > >>>> @@ -1200,3 +1200,15 @@ bool ip_matched(uint16_t family1, uint8_ > >>>> return true; > >>>> } > >>>> > >>>> +// > >>>> +void clms_cluster_reboot() > >>>> +{ > >>>> + CLMSV_MSG bcast_msg; > >>>> + bcast_msg.evt_type = CLMSV_CLMS_TO_CLMNA_REBOOT_MSG; > >>>> + bcast_msg.info.reboot_info.node_id = clms_cb->node_id; [Mathi] Iam okay with sending the node_id instead of a dummy message, but what is the need for using node_id here? I think perhaps cluster_id should also be sent along with this? > >>>> + if (clms_mds_msg_bcast(clms_cb, &bcast_msg) == > >>>> NCSCC_RC_SUCCESS) { > >>>> + LOG_NO("Sending cluster reboot broadcast message > >>>> succeeded"); > >>>> + } else { > >>>> + LOG_ER("Sending cluster reboot broadcast message > >>>> failed"); > >>>> + } > >>>> +} > >>>> diff --git a/osaf/services/saf/clmsv/nodeagent/main.c > >>>> b/osaf/services/saf/clmsv/nodeagent/main.c > >>>> --- a/osaf/services/saf/clmsv/nodeagent/main.c > >>>> +++ b/osaf/services/saf/clmsv/nodeagent/main.c > >>>> @@ -114,6 +114,18 @@ static uint32_t clmna_mds_dec(struct ncs > >>>> total_bytes += 4; > >>>> > >>>> switch (msg->evt_type) { > >>>> + case CLMSV_CLMS_TO_CLMNA_REBOOT_MSG: > >>>> + { > >>>> + p8 = ncs_dec_flatten_space(uba, local_data, 4); > >>>> + msg->info.reboot_info.node_id = > >>>> ncs_decode_32bit(&p8); > >>>> + ncs_dec_skip_space(uba, 4); > >>>> + total_bytes += 4; > >>>> + // Reboot will be performed by CLMS for this > >>>> node. > >>>> + if (clmna_cb->node_info.node_id != msg- > >>>>> info.reboot_info.node_id) { > >>>> + osaf_safe_reboot(); > >>>> + } > >>>> + break; > >>>> + } > >>>> case CLMSV_CLMS_TO_CLMA_API_RESP_MSG: > >>>> { > >>>> p8 = ncs_dec_flatten_space(uba, local_data, 8); > >>>> diff - > >> -git > >>>> a/scripts/opensaf_reboot b/scripts/opensaf_reboot > >>>> --- a/scripts/opensaf_reboot > >>>> +++ b/scripts/opensaf_reboot > >>>> @@ -40,10 +40,17 @@ NODE_ID_FILE=$pkglocalstatedir/node_id > >>>> > >>>> node_id=$1 > >>>> ee_name=$2 > >>>> +safe_reboot=$3 > >>>> > >>>> # Run commands through sudo when not superuser test $(id -u) > >>>> -ne 0 && icmd=$(which sudo 2> /dev/null) > >>>> > >>>> +opensaf_safe_reboot() > >>>> +{ > >>>> + logger -t "opensaf_reboot" "Rebooting local node using shutdown" > >>>> + $icmd /sbin/shutdown -r now > >>>> +} > >>>> + > >>>> ## Use stonith for remote fencing > >>>> opensaf_reboot_with_remote_fencing() > >>>> { > >>>> @@ -91,8 +98,12 @@ temp_node_id=`cat "$NODE_ID_FILE"` > >>>> temp_node_id=`echo "$temp_node_id" |sed -e 's:^0[bBxX]::'| sed -e > >>>> 's:^:0x:'` self_node_id=`printf "%d" $temp_node_id` > >>>> > >>>> -# A node ID of zero(0) means an order to reboot the local node -if > >>>> [ "$self_node_id" = "$node_id" ] || [ $node_id = 0 ]; then > >>>> + > >>>> +if [ "$safe_reboot" = 1 ]; then > >>>> + opensaf_safe_reboot > >>>> +else > >>>> + # A node ID of zero(0) means an order to reboot the local node > >>>> + if [ "$self_node_id" = "$node_id" ] || [ $node_id = 0 ]; then > >>>> # uncomment the following line if debugging errors that keep > >>>> restarting the node > >>>> # exit 0 > >>>> > >>>> @@ -114,8 +125,8 @@ if [ "$self_node_id" = "$node_id" ] || [ > >>>> > >>>> # Reboot (not shutdown) system WITH file system sync > >>>> $icmd /sbin/reboot -f > >>>> -else > >>>> - if [ "$FMS_USE_REMOTE_FENCING" = "1" ]; then > >>>> + else > >>>> + if [ "$FMS_USE_REMOTE_FENCING" = 1 ]; then > >>>> opensaf_reboot_with_remote_fencing > >>>> else > >>>> if [ ":$ee_name" != ":" ]; then @@ -133,4 +144,5 @@ else > >>>> logger -t "opensaf_reboot" "Rebooting remote > >>>> node > >> in the > >>>> absence of PLM is outside the scope of OpenSAF" > >>>> fi > >>>> fi > >>>> -fi > >>>> + fi > >>>> +fi > >>>> \ No newline at end of file > > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel