Okay, Thanks.
I wil get back, Iam going through the documentation of the cluster-glue
Iam wondering if a resource agent based approach might be more generic or 
extendable!?
- Mathi.


> -----Original Message-----
> From: Hans Nordebäck [mailto:hans.nordeb...@ericsson.com]
> Sent: Thursday, June 30, 2016 2:24 PM
> To: Mathivanan Naickan Palanivelu; Ramesh Babu Betham; Praveen Malviya;
> Anders Widell
> Cc: opensaf-devel@lists.sourceforge.net; hans Nordebäck
> Subject: Re: [devel] [PATCH 2 of 2] fm: Add support for remote fencing using
> STONITH [#1859]
> 
> Hi Mathi,
> 
> I run tests using xubuntu 14.04 with KVM and mainly used the man page for
> stonith.
> 
> To install stonith on each virtual machine:
> 
> sudo apt-get install cluster-glue
> 
> I tested using both ssh and tcp. Tcp is easier to deploy, if a firewall is 
> used add
> tcp port 16509 to the firewall rule on the host.
> If using ssh, run ssh-keygen  and copy the keys from each virtual machine to
> the host.
> 
> libvirt has to be installed on the host and the virtual machine and virsh can 
> be
> used to verify the setup, e.g:
> 
> virsh --connect=qemu+tcp://192.168.122.1/system list --all
> 
> The ip address is the address of the host running the hypervisor, and e.g. the
> SC's in my setup is using two interfaces, one interface for, the 192.168.122.1
> net for stonith, (backplane management) and one interface for the OpenSAF
> cluster using TIPC. The payloads only has one interface using TIPC.
> 
> /Thanks HansN
> 
> 
> 
> 
> 
> 
> 
> On 06/30/2016 10:26 AM, Mathivanan Naickan Palanivelu wrote:
> > Hi Hans,
> >
> > Could you please give a pointer to the webpage of the stonith agents
> (and/or daemons?) that you used to test these changes?
> >
> > Thanks,
> > Mathi.
> >
> >
> >> -----Original Message-----
> >> From: Hans Nordebäck [mailto:hans.nordeb...@ericsson.com]
> >> Sent: Thursday, June 30, 2016 1:08 PM
> >> To: Hans Nordebäck; Mathivanan Naickan Palanivelu; Ramesh Babu
> >> Betham; Praveen Malviya; Anders Widell
> >> Cc: opensaf-devel@lists.sourceforge.net
> >> Subject: RE: [devel] [PATCH 2 of 2] fm: Add support for remote
> >> fencing using STONITH [#1859]
> >>
> >> Hi, anyone that had time to look at this patch? It would be good the
> >> get some early feedback as it may have to some further changes I
> considered if e.g.
> >> PLM should be used for the configuration but it seems to be more
> >> work, what do you say?
> >>
> >> /Thanks HansN
> >>
> >> -----Original Message-----
> >> From: Hans Nordeback [mailto:hans.nordeb...@ericsson.com]
> >> Sent: den 21 juni 2016 20:49
> >> To: mathi.naic...@oracle.com; ramesh.bet...@oracle.com;
> >> praveen.malv...@oracle.com; Anders Widell
> >> <anders.wid...@ericsson.com>
> >> Cc: opensaf-devel@lists.sourceforge.net
> >> Subject: [devel] [PATCH 2 of 2] fm: Add support for remote fencing
> >> using STONITH [#1859]
> >>
> >>   00-README.conf                                  |   47 +++++++++
> >>   osaf/services/infrastructure/fm/config/fmd.conf |    9 +-
> >>   osaf/services/infrastructure/fm/fms/Makefile.am |    3 +-
> >>   osaf/services/infrastructure/fm/fms/fm_cb.h     |    4 +
> >>   osaf/services/infrastructure/fm/fms/fm_main.c   |  118
> >> +++++++++++++++++++++++-
> >>   scripts/opensaf_reboot                          |   47 +++++++--
> >>   6 files changed, 210 insertions(+), 18 deletions(-)
> >>
> >>
> >> diff --git a/00-README.conf b/00-README.conf
> >> --- a/00-README.conf
> >> +++ b/00-README.conf
> >> @@ -530,3 +530,50 @@ and not access any of its members direct
> >>   saAisNameBorrow() access functions shall be used. The
> >> SA_MAX_UNEXTENDED_NAME_LENGTH constant can be used to refer to
> the
> >> maximum  string length that can be stored in the unextended SaNameT
> >> type.
> >> +
> >> +Configuring remote fencing support using STONITH
> >> +================================================
> >> +
> >> +In an virtualized enironment STONITH can be used to for remote
> >> +fencing the other system controller in case of "link loss" or the
> >> +peer system controller is "live hanging", this to avoid split-brains.
> >> +Node self-fencing will also be used if e.g. the active controller
> >> +loses connectivity to all other nodes in the cluster.
> >> +
> >> +Example installing on using Ubuntu 14.04,
> >> +
> >> +On each virtual node install stonith package:
> >> +
> >> +  sudo apt-get install cluster-glue
> >> +
> >> +The name of each virtual node should be the same as the clm node
> >> +name, e.g. safNode=SC-2,safCluster=myClmCluster the virtual node
> >> +name should
> >> be SC-2.
> >> +
> >> +If a firewall is used on the "hypervisor" host, the tcp port 16509
> >> +has to be added. If ssh is used use ssh-keygen and generate ssh keys
> >> +for each virtual node.
> >> +
> >> +To verify the installation virsh can be used, e.g:
> >> +virsh --connect=qemu+tcp://192.168.122.1/system list --all
> >> +
> >> +Example of output:
> >> +Id    Name                           State
> >> +----------------------------------------------------
> >> + 2     SC-1                           running
> >> + 3     SC-2                           running
> >> + 4     PL-3                           running
> >> +
> >> +Update the fmd.conf file:
> >> +
> >> +# The Promote active timer is set to delay the Standby controllers
> >> +reboot request, # as the Active controller probably also are
> >> +requesting
> >> reboot of the standby.
> >> +# The resolution is in 10 ms units.
> >> +export FMS_PROMOTE_ACTIVE_TIMER=300
> >> +
> >> +# Uncomment the next 5 lines and update acordingly to enable remote
> >> +fencing # See also documentation for STONITH export
> >> +FMS_USE_REMOTE_FENCING=1 export FMS_FENCE_CMD="stonith"
> >> +export FMS_DEVICE_TYPE="external/libvirt"
> >> +export FMS_HYPERVISOR_URI="qemu+tcp://192.168.122.1/system"
> >> +export FMS_FENCE_ACTION="reset"
> >> diff --git a/osaf/services/infrastructure/fm/config/fmd.conf
> >> b/osaf/services/infrastructure/fm/config/fmd.conf
> >> --- a/osaf/services/infrastructure/fm/config/fmd.conf
> >> +++ b/osaf/services/infrastructure/fm/config/fmd.conf
> >> @@ -17,7 +17,14 @@ export FM_CONTROLLER2_SUBSLOT=15  export
> >> FMS_HA_ENV_HEALTHCHECK_KEY="Default"
> >>
> >>   # Promote active timer
> >> -export FMS_PROMOTE_ACTIVE_TIMER=0
> >> +export FMS_PROMOTE_ACTIVE_TIMER=500
> >> +
> >> +# Uncomment the next 5 lines and update acordingly to enable remote
> >> +fencing export FMS_USE_REMOTE_FENCING=1 export
> >> FMS_FENCE_CMD="stonith"
> >> +export FMS_DEVICE_TYPE="external/libvirt"
> >> +export FMS_HYPERVISOR_URI="qemu+tcp://192.168.122.1/system"
> >> +export FMS_FENCE_ACTION="reset"
> >>
> >>   # FM will supervise transitions to the ACTIVE role when this
> >> variable is set to # a non-zero value. The value is the time in the
> >> unit of 10 ms to wait for a diff --git
> >> a/osaf/services/infrastructure/fm/fms/Makefile.am
> >> b/osaf/services/infrastructure/fm/fms/Makefile.am
> >> --- a/osaf/services/infrastructure/fm/fms/Makefile.am
> >> +++ b/osaf/services/infrastructure/fm/fms/Makefile.am
> >> @@ -46,4 +46,5 @@ osaffmd_SOURCES = \
> >>   osaffmd_LDADD = \
> >>    $(top_builddir)/osaf/libs/core/libopensaf_core.la \
> >>    $(top_builddir)/osaf/libs/saf/libSaAmf/libSaAmf.la \
> >> -  $(top_builddir)/osaf/libs/agents/infrastructure/rda/librda.la
> >> +  $(top_builddir)/osaf/libs/agents/infrastructure/rda/librda.la \
> >> +  $(top_builddir)/osaf/libs/saf/libSaClm/libSaClm.la
> >> diff --git a/osaf/services/infrastructure/fm/fms/fm_cb.h
> >> b/osaf/services/infrastructure/fm/fms/fm_cb.h
> >> --- a/osaf/services/infrastructure/fm/fms/fm_cb.h
> >> +++ b/osaf/services/infrastructure/fm/fms/fm_cb.h
> >> @@ -26,6 +26,7 @@
> >>   #include "mds_papi.h"
> >>   #include "rda_papi.h"
> >>   #include "fm_amf.h"
> >> +#include "saClm.h"
> >>
> >>   #include <stdbool.h>
> >>   #include <stdint.h>
> >> @@ -102,6 +103,9 @@ typedef struct fm_cb {
> >>    uint64_t cluster_size;
> >>    struct timespec last_well_connected;
> >>    struct timespec node_isolation_timeout;
> >> +  SaClmHandleT clm_hdl;
> >> +  bool use_remote_fencing;
> >> +  SaNameT peer_clm_node_name;
> >>   } FM_CB;
> >>
> >>   extern char *role_string[];
> >> diff --git a/osaf/services/infrastructure/fm/fms/fm_main.c
> >> b/osaf/services/infrastructure/fm/fms/fm_main.c
> >> --- a/osaf/services/infrastructure/fm/fms/fm_main.c
> >> +++ b/osaf/services/infrastructure/fm/fms/fm_main.c
> >> @@ -32,6 +32,13 @@ This file contains the main() routine fo  #include
> "fm.h"
> >>   #include "osaf_time.h"
> >>
> >> +#define FM_CLM_API_TIMEOUT 10000000000LL
> >> +
> >> +static    SaVersionT clm_version = { 'B', 4, 1 };
> >> +static const SaClmCallbacksT_4 clm_callbacks = {
> >> +  0, 0
> >> +};
> >> +
> >>   enum {
> >>    FD_TERM = 0,
> >>    FD_AMF = 1,
> >> @@ -54,6 +61,8 @@ static uint32_t fm_get_args(FM_CB *);  static
> >> uint32_t fms_fms_exchange_node_info(FM_CB *);  static uint32_t
> >> fm_nid_notify(uint32_t);  static uint32_t fm_tmr_start(FM_TMR *,
> >> SaTimeT);
> >> +static SaAisErrorT get_peer_clm_node_name(NODE_ID); static
> >> +SaAisErrorT fm_clm_init();
> >>   static void fm_mbx_msg_handler(FM_CB *, FM_EVT *);  static void
> >> fm_evt_proc_rda_callback(FM_CB*, FM_EVT*);  static void
> >> fm_tmr_exp(void *); @@ -313,6 +322,8 @@ uint32_t
> >> initialize_for_assignment(FM_CB
> >>            LOG_ER("immd_mds_register FAILED %d", rc);
> >>            goto done;
> >>    }
> >> +
> >> +  cb->clm_hdl = 0;
> >>    cb->fully_initialized = true;
> >>   done:
> >>    TRACE_LEAVE2("rc = %u", rc);
> >> @@ -383,8 +394,17 @@ static uint32_t fm_agents_startup(void)
> >>
> **********************************************************
> >> *******************/
> >>   static uint32_t fm_get_args(FM_CB *fm_cb)  {
> >> +  char *use_remote_fencing = NULL;
> >>    char *value;
> >>    TRACE_ENTER();
> >> +
> >> +  fm_cb->use_remote_fencing = false;
> >> +  use_remote_fencing = getenv("FMS_USE_REMOTE_FENCING");
> >> +  if (use_remote_fencing != NULL) {
> >> +          fm_cb->use_remote_fencing = true;
> >> +          LOG_NO("Remote fencing is enabled");
> >> +  }
> >> +
> >>    value = getenv("EE_ID");
> >>    if (value != NULL) {
> >>            fm_cb->node_name.length = strlen(value); @@ -474,6
> >> +494,81 @@ void fm_proc_svc_down(FM_CB *cb, FM_EVT  }
> >>
> >>
> >>
> /**********************************************************
> >> ******************
> >> +* Name          : fm_clm_init
> >> +*
> >> +* Description   : Initialize CLM.
> >> +*
> >> +* Arguments     : None.
> >> +*
> >> +* Return Values : None.
> >> +*
> >> +* Notes         : None.
> >>
> +*********************************************************
> >> **************
> >> +******/ static SaAisErrorT get_peer_clm_node_name(NODE_ID
> node_id) {
> >> +  SaAisErrorT rc = SA_AIS_OK;
> >> +  char *node;
> >> +  SaClmClusterNodeT_4 cluster_node;
> >> +
> >> +  if ((rc = fm_clm_init()) != SA_AIS_OK) {
> >> +          LOG_ER("clm init FAILED %d", rc);
> >> +  } else {
> >> +          LOG_NO("clm init OK");
> >> +  }
> >> +
> >> +  if ((rc = saClmClusterNodeGet_4(fm_cb->clm_hdl, node_id,
> >> FM_CLM_API_TIMEOUT, &cluster_node)) == SA_AIS_OK) {
> >> +          // Extract peer clm node name, e.g SC-2 from "safNode=SC-
> >> 2,safCluster=myClmCluster"
> >> +          // The peer clm node name will be passed to opensaf_reboot
> >> script to support remote fencing.
> >> +          // The peer clm node name should correspond to the name
> >> of the virtual machine for that node.
> >> +
> >> +          node = strtok((char*) cluster_node.nodeName.value, "=");
> >> +          node = strtok(NULL, ",");
> >> +          strncpy((char*) fm_cb->peer_clm_node_name.value, node,
> >> cluster_node.nodeName.length);
> >> +          LOG_NO("Peer clm node name: %s", fm_cb-
> >>> peer_clm_node_name.value);
> >> +  } else {
> >> +          LOG_WA("saClmClusterNodeGet_4 returned %u",
> >> (unsigned) rc);
> >> +  }
> >> +  return rc;
> >> +}
> >> +
> >>
> +/*********************************************************
> >> *******************
> >> +* Name          : fm_clm_init
> >> +*
> >> +* Description   : Initialize CLM.
> >> +*
> >> +* Arguments     : None.
> >> +*
> >> +* Return Values : None.
> >> +*
> >> +* Notes         : None.
> >>
> +*********************************************************
> >> **************
> >> +******/
> >> +static SaAisErrorT fm_clm_init()
> >> +{
> >> +  SaAisErrorT rc = SA_AIS_OK;
> >> +
> >> +  for (;;) {
> >> +          rc = saClmInitialize_4(&fm_cb->clm_hdl, &clm_callbacks,
> >> &clm_version);
> >> +          if (rc == SA_AIS_ERR_TRY_AGAIN ||
> >> +                  rc == SA_AIS_ERR_TIMEOUT ||
> >> +                  rc == SA_AIS_ERR_UNAVAILABLE) {
> >> +                  LOG_WA("saClmInitialize_4 returned %u",
> >> (unsigned) rc);
> >> +
> >> +                  if (rc != SA_AIS_ERR_TRY_AGAIN) {
> >> +                          LOG_WA("saClmInitialize_4 returned %u",
> >> +                                  (unsigned) rc);
> >> +                  }
> >> +                  osaf_nanosleep(&kHundredMilliseconds);
> >> +                  continue;
> >> +          }
> >> +          if (rc == SA_AIS_OK) break;
> >> +          LOG_ER("Failed to Initialize with CLM: %u", rc);
> >> +          goto done;
> >> +  }
> >> +done:
> >> +  return rc;
> >> +}
> >> +
> >>
> +/*********************************************************
> >> *************
> >> +******
> >>   * Name          : fm_mbx_msg_handler
> >>   *
> >>   * Description   : Processes Mail box messages between FM.
> >> @@ -517,8 +612,13 @@ static void fm_mbx_msg_handler(FM_CB *fm
> >>                                     * but just that failover has been
> trigerred quicker than the
> >>                                     * node_down event has been
> >> received.
> >>                                     */
> >> -                          opensaf_reboot(fm_cb->peer_node_id,
> >> (char *)fm_cb->peer_node_name.value,
> >> -                                          "Received Node Down for
> >> peer controller");
> >> +                          if (fm_cb->use_remote_fencing) {
> >> +                                  opensaf_reboot(fm_cb-
> >>> peer_node_id, (char *)fm_cb->peer_clm_node_name.value,
> >> +                                                  "Received Node
> >> Down for peer controller");
> >> +                          } else {
> >> +                                  opensaf_reboot(fm_cb-
> >>> peer_node_id, (char *)fm_cb->peer_node_name.value,
> >> +                                                  "Received Node
> >> Down for peer controller");
> >> +                          }
> >>                            if (!((fm_cb->role == PCS_RDA_ACTIVE) &&
> (fm_cb->amf_state ==
> >> (SaAmfHAStateT)PCS_RDA_ACTIVE))) {
> >>                                    fm_cb->role = PCS_RDA_ACTIVE;
> >>                                    LOG_NO("Controller Failover: Setting
> role to ACTIVE"); @@
> >> -534,6 +634,8 @@ static void fm_mbx_msg_handler(FM_CB *fm
> >>   /* Peer fm came up so sending ee_id of this node */
> >>            if (fm_cb->node_name.length != 0)
> >>                    fms_fms_exchange_node_info(fm_cb);
> >> +
> >> +          get_peer_clm_node_name(fm_mbx_evt->node_id);
> >>            break;
> >>    case FM_EVT_TMR_EXP:
> >>   /* Timer Expiry event posted */
> >> @@ -547,8 +649,16 @@ static void fm_mbx_msg_handler(FM_CB *fm
> >>                    fm_cb->role = PCS_RDA_ACTIVE;
> >>
> >>                    LOG_NO("Reseting peer controller node id: %x",
> >> fm_cb->peer_node_id);
> >> -                  opensaf_reboot(fm_cb->peer_node_id, (char
> >> *)fm_cb->peer_node_name.value,
> >> -                                 "Received Node Down for Active peer");
> >> +                  if (fm_cb->use_remote_fencing) {
> >> +                          LOG_NO("saClmClusterNodeGet succeeded
> >> node_id 0x%X, clm peer node name %s",
> >> +                                  fm_mbx_evt->node_id, fm_cb-
> >>> peer_clm_node_name.value);
> >> +
> >> +                          opensaf_reboot(fm_cb->peer_node_id,
> >> (char *)fm_cb->peer_clm_node_name.value,
> >> +                                          "Received Node Down for
> >> peer controller");
> >> +                  } else {
> >> +                          opensaf_reboot(fm_cb->peer_node_id,
> >> (char *)fm_cb->peer_node_name.value,
> >> +                                         "Received Node Down for Active
> >> peer");
> >> +                  }
> >>                    fm_rda_set_role(fm_cb, PCS_RDA_ACTIVE);
> >>            } else if (fm_mbx_evt->info.fm_tmr->type ==
> >> FM_TMR_ACTIVATION_SUPERVISION) {
> >>                    opensaf_reboot(0, NULL, "Activation timer
> supervision "
> >> diff --git a/scripts/opensaf_reboot b/scripts/opensaf_reboot
> >> --- a/scripts/opensaf_reboot
> >> +++ b/scripts/opensaf_reboot
> >> @@ -26,13 +26,31 @@
> >>   # through proprietary mechanisms, i.e. not through PLM. Node_id is
> >> (the only  # entity) at the disposal of such a mechanism.
> >>
> >> +if [ -f "$pkgsysconfdir/fmd.conf" ]; then
> >> +  . "$pkgsysconfdir/fmd.conf"
> >> +fi
> >> +
> >>   NODE_ID_FILE=$pkglocalstatedir/node_id
> >> +
> >>   node_id=$1
> >>   ee_name=$2
> >>
> >>   # Run commands through sudo when not superuser  test $(id -u) -ne 0
> >> && icmd=$(which sudo 2> /dev/null)
> >>
> >> +## Use stonith for remote fencing
> >> +opensaf_reboot_with_remote_fencing()
> >> +{
> >> +  "$FMS_FENCE_CMD" -t "$FMS_DEVICE_TYPE"
> >> hostlist="node:$ee_name"
> >> +hypervisor_uri="$FMS_HYPERVISOR_URI" -T "$FMS_FENCE_ACTION"
> node
> >> +
> >> +  retval=$?
> >> +  if [ $retval != 0 ]; then
> >> +          logger -t "opensaf_reboot" "Rebooting remote node
> >> $ee_name using $FMS_FENCE_CMD failed, rc: $retval"
> >> +  exit 1
> >> +  fi
> >> +}
> >> +
> >> +
> >>   #if plm exists in the system,then the reboot is performed using the
> eename.
> >>   opensaf_reboot_with_plm()
> >>   {
> >> @@ -86,17 +104,22 @@ if [ "$self_node_id" = "$node_id" ] || [
> >>    # Reboot (not shutdown) system WITH file system sync
> >>    $icmd /sbin/reboot -f
> >>   else
> >> -  if [ ":$ee_name" != ":" ]; then
> >> -          plm_node_presence_state=`immlist $ee_name |grep
> >> saPlmEEPresenceState|awk '{print $3}'`
> >> -          plm_node_state=`immlist $ee_name |grep
> >> saPlmEEAdminState|awk '{print $3}'`
> >> -          if [ "$plm_node_presence_state" != 3 ] ; then
> >> -                  logger -t "opensaf_reboot" "Not rebooting remote
> >> node $ee_name as it is not in INSTANTIATED state"
> >> -          elif [ $plm_node_state != 2 ]; then
> >> -                  opensaf_reboot_with_plm
> >> -          else
> >> -                  logger -t "opensaf_reboot" "Not rebooting remote
> >> node $ee_name as it is already in locked state"
> >> +  if [ "$FMS_USE_REMOTE_FENCING" = "1" ]; then
> >> +          opensaf_reboot_with_remote_fencing
> >> +  else
> >> +          if [ ":$ee_name" != ":" ]; then
> >> +
> >> +                  plm_node_presence_state=`immlist $ee_name
> >> |grep saPlmEEPresenceState|awk '{print $3}'`
> >> +                  plm_node_state=`immlist $ee_name |grep
> >> saPlmEEAdminState|awk '{print $3}'`
> >> +                  if [ "$plm_node_presence_state" != 3 ] ; then
> >> +                          logger -t "opensaf_reboot" "Not rebooting
> >> remote node $ee_name as it is not in INSTANTIATED state"
> >> +                  elif [ $plm_node_state != 2 ]; then
> >> +                          opensaf_reboot_with_plm
> >> +                  else
> >> +                          logger -t "opensaf_reboot" "Not rebooting
> >> remote node $ee_name as it is already in locked state"
> >> +                  fi
> >> +          else
> >> +                  logger -t "opensaf_reboot" "Rebooting remote node
> in the
> >> absence of PLM is outside the scope of OpenSAF"
> >>            fi
> >> -  else
> >> -          logger -t "opensaf_reboot" "Rebooting remote node in the
> absence
> >> of PLM is outside the scope of OpenSAF"
> >> -  fi
> >> +  fi
> >>   fi
> >>
> >> ---------------------------------------------------------------------
> >> --------- Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T
> >> Park in San Francisco, CA to explore cutting-edge tech and listen to
> >> tech luminaries present their vision of the future. This family event
> >> has something for everyone, including kids. Get more information and
> >> register today.
> >> http://sdm.link/attshape
> >> _______________________________________________
> >> Opensaf-devel mailing list
> >> Opensaf-devel@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
> 
> 

------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to