Re: [devel] [PATCH 2/5] rded: add split brain prevention support [#64]

2018-01-23 Thread Gary Lee

Hi Anders

Will change according to your comments, one comment below:

On 24/01/18 01:53, Anders Widell wrote:

Ack for this patch with comments, marked AndersW>

regards,

Anders Widell



+   case RDE_MSG_NEW_ACTIVE_CALLBACK:
+  {
+    const std::string my_node = base::Conf::NodeName();
+    rde_cb->monitor_lock_thread_running = false;
+
+    // get current active controller
+    Consensus consensus_service;
AndersW> Shouldn't the Consensus instance be created once, instead of 
creating a new instance each time you receive this callback? The 
Consensus constructor even logs to syslog (at INFO level).


[Gary] I will remove the syslog calls in the constructor, but I'd like 
to keep it as a local variable and only instantiate when needed. It's 
fairly light weight and only constructed when there is a controller 
failover / switchover. Is that OK?


Thanks
Gary


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1/5] osaf: add consensus API [#64]

2018-01-23 Thread Gary Lee

Hi Anders

Thanks for the feedback. Will change before pushing.

Gary


On 24/01/18 01:22, Anders Widell wrote:
Ack for this patch with comments below, marked AndersW>. An additional 
comment is that Makefiles should be updated in the same patch that 
adds the new files, e.g. under osaf/consensus. Currently, the files 
are added in this patch but Makefile.am is updated in patch number 5, 
so the Makefile.am updates should be moved to this patch.


regards,

Anders Widell


On 01/23/2018 09:06 AM, Gary Lee wrote:

---
  src/osaf/consensus/Makefile  |  18 +++
  src/osaf/consensus/keyvalue.cc   | 221 
++

  src/osaf/consensus/keyvalue.h    |  66 
  src/osaf/consensus/plugins/etcd.plugin   | 253 
++

  src/osaf/consensus/plugins/sample.plugin | 171 
  src/osaf/consensus/service.cc    | 258 
+++

  src/osaf/consensus/service.h |  71 +
  7 files changed, 1058 insertions(+)
  create mode 100644 src/osaf/consensus/Makefile
  create mode 100644 src/osaf/consensus/keyvalue.cc
  create mode 100644 src/osaf/consensus/keyvalue.h
  create mode 100644 src/osaf/consensus/plugins/etcd.plugin
  create mode 100644 src/osaf/consensus/plugins/sample.plugin
  create mode 100644 src/osaf/consensus/service.cc
  create mode 100644 src/osaf/consensus/service.h

diff --git a/src/osaf/consensus/Makefile b/src/osaf/consensus/Makefile
new file mode 100644
index 0..a2c8bc9dd
--- /dev/null
+++ b/src/osaf/consensus/Makefile
@@ -0,0 +1,18 @@
+#  -*- OpenSAF  -*-
+#
+# (C) Copyright 2018 The OpenSAF Foundation

AndersW> Should be Copyright Ericsson AB 2018 - All Rights Reserved.

+#
+# This program is distributed in the hope that it will be useful, but
+# WITHOUT ANY WARRANTY; without even the implied warranty of 
MERCHANTABILITY
+# or FITNESS FOR A PARTICULAR PURPOSE. This file and program are 
licensed
+# under the GNU Lesser General Public License Version 2.1, February 
1999.

+# The complete license can be accessed from the following location:
+# http://opensource.org/licenses/lgpl-license.php
+# See the Copying file included with the OpenSAF distribution for full
+# licensing terms.
+#
+# Author(s): Ericsson AB

AndersW> Remove the line above.

+#
+
+all:
+    $(MAKE) -C ../.. lib/libconsensus.la
diff --git a/src/osaf/consensus/keyvalue.cc 
b/src/osaf/consensus/keyvalue.cc

new file mode 100644
index 0..eea518585
--- /dev/null
+++ b/src/osaf/consensus/keyvalue.cc
@@ -0,0 +1,221 @@
+/*  -*- OpenSAF  -*-
+ *
+ * (C) Copyright 2018 The OpenSAF Foundation

AndersW> Should be Copyright Ericsson AB 2018 - All Rights Reserved.


+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of 
MERCHANTABILITY
+ * or FITNESS FOR A PARTICULAR PURPOSE. This file and program are 
licensed
+ * under the GNU Lesser General Public License Version 2.1, February 
1999.

+ * The complete license can be accessed from the following location:
+ * http://opensource.org/licenses/lgpl-license.php
+ * See the Copying file included with the OpenSAF distribution for full
+ * licensing terms.
+ *
+ * Author(s): Ericsson AB

AndersW> Remove the line above.

+ *
+ */
+#include "osaf/consensus/keyvalue.h"
+#include 
+#include "base/logtrace.h"
+#include "base/getenv.h"
+#include "base/conf.h"

AndersW> The three last lines should be sorted alphabetically.

+
+int KeyValue::Execute(const std::string& command, std::string& 
output) {

+  TRACE_ENTER();
+  constexpr size_t buf_size = 128;
+  std::array buffer;
+  FILE* pipe = popen(command.c_str(), "r");
+  if (pipe == nullptr) {
+    return 1;
+  }
+  output = "";

AndersW> Maybe output.clear() is slightly better?

+  while (feof(pipe) == 0) {
+    if (fgets(buffer.data(), buf_size, pipe) != nullptr) {
+  output += buffer.data();
+    }
+  }
+  int exit_code = pclose(pipe);
+  exit_code = WEXITSTATUS(exit_code);
+  if (output.empty() == false && isspace(output.back()) != 0) {
+    // remove newline at end of output
+    output.pop_back();
+  }
+  TRACE("Executed '%s', returning %d", command.c_str(), exit_code);
+  return exit_code;
+}
+
+SaAisErrorT KeyValue::Get(const std::string& key, std::string& value) {
+  TRACE_ENTER();
+
+  const std::string kv_store_cmd = base::GetEnv(
+    "FMS_KEYVALUE_STORE_PLUGIN_CMD", "");
+  const std::string command(kv_store_cmd + " get " + key);
+  int rc = KeyValue::Execute(command, value);
+  TRACE("Read '%s'", value.c_str());
+
+  if (rc == 0) {
+    return SA_AIS_OK;
+  } else {
+    return SA_AIS_ERR_FAILED_OPERATION;
+  }
+}
+
+SaAisErrorT KeyValue::Set(const std::string& key, const std::string& 
value) {

+  TRACE_ENTER();
+
+  const std::string kv_store_cmd = base::GetEnv(
+    "FMS_KEYVALUE_STORE_PLUGIN_CMD", "");
+  const std::string command(kv_store_cmd + " set " + key + " " + 
value);

+  

Re: [devel] [PATCH 0/3] Review Request for ntf: Checkpoint and cold sync reader information [#2757]

2018-01-23 Thread minh . chau
Hi Lennart,

I tested the APIs between versions with/without the changes. I will send
out for review the README and PR change after the code review is done. One
limitation is that both active and standby require the patches to work.

Thanks,
Minh

> Hi Minh
>
> Ack. I have not tested much
>
> Have you tested using the reader API while running old version on standby
> and new version on active and vice versa (upgrade case)? Limitations?
> PR documentation update?
>
> Thanks
> Lennart
>
>> -Original Message-
>> From: Minh Hon Chau
>> Sent: den 22 januari 2018 05:19
>> To: Lennart Lund ;
>> srinivas.mangip...@oracle.com; Canh Van Truong
>> 
>> Cc: opensaf-devel@lists.sourceforge.net; Minh Hon Chau
>> 
>> Subject: [PATCH 0/3] Review Request for ntf: Checkpoint and cold sync
>> reader information [#2757]
>>
>> Summary: ntfd: Checkpoint reader to the standby when processes reader
>> API requests [#2757]
>> Review request for Ticket(s): 2757
>> Peer Reviewer(s): Lennart, Srinivas, Canh
>> Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE ***
>> Affected branch(es): develop
>> Development branch: ticket-2757
>> Base revision: ee105cb3bf44eee4e8785e3de7d24f907641e2ab
>> Personal repository: git://git.code.sf.net/u/minh-chau/review
>>
>> 
>> Impacted area   Impact y/n
>> 
>>  Docsn
>>  Build systemn
>>  RPM/packaging   n
>>  Configuration files n
>>  Startup scripts n
>>  SAF servicesy
>>  OpenSAF servicesn
>>  Core libraries  n
>>  Samples n
>>  Tests   n
>>  Other   n
>>
>> NOTE: Patch(es) contain lines longer than 80 characers
>>
>> Comments (indicate scope for each "y" above):
>> -
>> *** EXPLAIN/COMMENT THE PATCH SERIES HERE ***
>>
>> revision 74da3370accfa44a34a7abf9830ceaeae3ab5d4f
>> Author:Minh Chau 
>> Date:Mon, 22 Jan 2018 15:08:59 +1100
>>
>> ntftest: Add new test cases of suite 41 for cold sync and checkpoint of
>> reader
>> APIs [#2757]
>>
>>
>>
>> revision ad38745b1c411bc52905725281c84c69e4147fef
>> Author:Minh Chau 
>> Date:Mon, 22 Jan 2018 15:03:42 +1100
>>
>> ntfd: Cold sync reader to the standby ntfd after rebooting the standby
>> controller [#2757]
>>
>> Assumpt that the reader information is updated to the standby ntfd via
>> checkpoint
>> upon reception of reader APIs requests. However, if the standby
>> controller
>> reboots
>> and comes up, the standby ntfd still has none of readers information
>> which is
>> available at the active ntfd. Now if a switchover happens, the new
>> active will
>> not
>> be able to process the reader APIs requests with existing reader
>> handles.
>>
>> This patch adds reader information as part of cold sync
>>
>>
>>
>> revision 47cf18850e6819c2db4642eb1e639aff5f0d8282
>> Author:Minh Chau 
>> Date:Mon, 22 Jan 2018 14:12:00 +1100
>>
>> ntfd: Checkpoint reader to the standby when processes reader API
>> requests
>> [#2757]
>>
>> When active ntfd receives reader API requests: ReaderIntialize,
>> ReadNext,
>> ReadFinalize, active ntfd does not update the readers information to the
>> standby. Thus, either switchover or failover happens, the client can not
>> continue to use the reader APIs, because there is no such reader
>> information
>> still available in the new active after switchover.
>>
>> The patch does checkpoint reader information to the standby when
>> completes
>> processing reader APIs request.
>>
>>
>>
>> Complete diffstat:
>> --
>>  src/ntf/agent/ntfa_mds.c   |  51 +--
>>  src/ntf/apitest/tet_coldsync.c | 690
>> -
>>  src/ntf/common/ntfsv_enc_dec.c |  88 +-
>>  src/ntf/common/ntfsv_enc_dec.h |  12 +-
>>  src/ntf/ntfd/NtfAdmin.cc   | 145 +++--
>>  src/ntf/ntfd/NtfAdmin.h|  17 +-
>>  src/ntf/ntfd/NtfClient.cc  |  68 +++-
>>  src/ntf/ntfd/NtfClient.h   |  11 +-
>>  src/ntf/ntfd/NtfLogger.cc  |   2 +-
>>  src/ntf/ntfd/NtfReader.cc  |  84 +++--
>>  src/ntf/ntfd/NtfReader.h   |  13 +-
>>  src/ntf/ntfd/ntfs_com.c| 105 +++
>>  src/ntf/ntfd/ntfs_com.h|  25 +-
>>  src/ntf/ntfd/ntfs_evt.c|  14 +-
>>  src/ntf/ntfd/ntfs_mbcsv.c  | 287 ++---
>>  src/ntf/ntfd/ntfs_mbcsv.h  |  16 +
>>  src/ntf/ntfd/ntfs_mds.c|  42 +--
>>  17 files changed, 1430 insertions(+), 240 deletions(-)
>>
>>
>> Testing Commands:
>> -
>> Run all test cases of suite 41, and legacy suites
>>
>>
>> Testing, Expected Results:
>> --
>> All pass
>>
>>
>> Conditions of Submission:
>> -
>> ack from reviewers
>>
>>
>> Arch  Built StartedLinux distro
>> 

Re: [devel] [PATCH 2/5] rded: add split brain prevention support [#64]

2018-01-23 Thread Anders Widell

Ack for this patch with comments, marked AndersW>

regards,

Anders Widell


On 01/23/2018 09:06 AM, Gary Lee wrote:

* consult with consensus service before promoting node to active
* add watch thread and self-fence if it detects active controller
   has been changed (if remote fencing is disabled)
---
  src/rde/Makefile.am   |  3 ++-
  src/rde/rded/osaf-rded.in |  4 
  src/rde/rded/rde_cb.h |  4 +++-
  src/rde/rded/rde_main.cc  | 38 +-
  src/rde/rded/role.cc  | 45 -
  src/rde/rded/role.h   |  3 +++
  6 files changed, 89 insertions(+), 8 deletions(-)

diff --git a/src/rde/Makefile.am b/src/rde/Makefile.am
index c967f9fc4..182f347ab 100644
--- a/src/rde/Makefile.am
+++ b/src/rde/Makefile.am
@@ -58,7 +58,8 @@ bin_osafrded_SOURCES = \
  
  bin_osafrded_LDADD = \

lib/libSaAmf.la \
-   lib/libopensaf_core.la
+   lib/libopensaf_core.la \
+   lib/libosaf_common.la
  
  bin_rdegetrole_CPPFLAGS = \

$(AM_CPPFLAGS)
diff --git a/src/rde/rded/osaf-rded.in b/src/rde/rded/osaf-rded.in
index 1c1786c8d..1697936a7 100644
--- a/src/rde/rded/osaf-rded.in
+++ b/src/rde/rded/osaf-rded.in
@@ -28,6 +28,10 @@ else
. $pkgsysconfdir/rde.conf
  fi
  
+if [ -f "$pkgsysconfdir/fmd.conf" ]; then

+  . "$pkgsysconfdir/fmd.conf"
+fi
+
  binary=$pkglibdir/$osafprog
  pidfile=$pkgpiddir/$osafprog.pid
  tracefile=$pkglogdir/$osafprog.log
diff --git a/src/rde/rded/rde_cb.h b/src/rde/rded/rde_cb.h
index d2a3d46b2..fc100849a 100644
--- a/src/rde/rded/rde_cb.h
+++ b/src/rde/rded/rde_cb.h
@@ -39,13 +39,15 @@ struct RDE_CONTROL_BLOCK {
bool task_terminate;
RDE_RDA_CB rde_rda_cb;
RDE_AMF_CB rde_amf_cb;
+  bool monitor_lock_thread_running;
  };
  
  enum RDE_MSG_TYPE {

RDE_MSG_PEER_UP = 1,
RDE_MSG_PEER_DOWN = 2,
RDE_MSG_PEER_INFO_REQ = 3,
-  RDE_MSG_PEER_INFO_RESP = 4
+  RDE_MSG_PEER_INFO_RESP = 4,
+  RDE_MSG_NEW_ACTIVE_CALLBACK = 5
  };
  
  struct rde_peer_info {

diff --git a/src/rde/rded/rde_main.cc b/src/rde/rded/rde_main.cc
index 0298bf3ff..082c1c040 100644
--- a/src/rde/rded/rde_main.cc
+++ b/src/rde/rded/rde_main.cc
@@ -28,6 +28,7 @@
  #include 
  #include 
  #include 
+#include "osaf/consensus/service.h"
  #include "base/daemon.h"
  #include "base/logtrace.h"
  #include "base/osaf_poll.h"
@@ -37,6 +38,7 @@
  #include 
  #include "rde/rded/rde_cb.h"
  #include "rde/rded/role.h"
+#include "base/conf.h"

AndersW> Sort project include files alphabetically.
  
  #define RDA_MAX_CLIENTS 32
  
@@ -92,10 +94,6 @@ static void handle_mbx_event() {

TRACE_ENTER();
  
msg = reinterpret_cast(ncs_ipc_non_blk_recv(_cb->mbx));

-  TRACE("Received %s from node 0x%x with state %s. My state is %s",
-rde_msg_name[msg->type], msg->fr_node_id,
-Role::to_string(msg->info.peer_info.ha_role),
-Role::to_string(role->role()));
  
switch (msg->type) {

  case RDE_MSG_PEER_INFO_REQ:
@@ -118,6 +116,34 @@ static void handle_mbx_event() {
  case RDE_MSG_PEER_DOWN:
LOG_NO("Peer down on node 0x%x", msg->fr_node_id);
break;
+   case RDE_MSG_NEW_ACTIVE_CALLBACK:
+  {
+const std::string my_node = base::Conf::NodeName();
+rde_cb->monitor_lock_thread_running = false;
+
+// get current active controller
+Consensus consensus_service;
AndersW> Shouldn't the Consensus instance be created once, instead of 
creating a new instance each time you receive this callback? The 
Consensus constructor even logs to syslog (at INFO level).

+std::string active_controller = consensus_service.CurrentActive();
+
+LOG_NO("New active controller notification from consensus service");
+
+if (role->role() == PCS_RDA_ACTIVE) {
+  if (my_node.compare(active_controller) != 0) {
+// we are meant to be active, but consensus service doesn't think 
so
+LOG_WA("Role does not match consensus service. New controller: %s",
+  active_controller.c_str());
+if (consensus_service.IsRemoteFencingEnabled() == false ) {
+  LOG_ER("Probable split-brain. Rebooting this node");
+  opensaf_reboot(0, nullptr, "Split-brain detected by consensus 
service");
+}
+  }
+
+  // register for callback
+  rde_cb->monitor_lock_thread_running = true;
+  consensus_service.MonitorLock(Role::MonitorCallback, rde_cb->mbx);
+}
+  }
+  break;
  default:
LOG_ER("%s: discarding unknown message type %u", __FUNCTION__, 
msg->type);
break;
@@ -192,6 +218,7 @@ static int initialize_rde() {
  goto init_failed;
}
  
+  rde_cb->monitor_lock_thread_running = false;

rc = NCSCC_RC_SUCCESS;
  
  init_failed:

@@ -205,11 +232,12 @@ int main(int argc, char *argv[]) {
NCS_SEL_OBJ mbx_sel_obj;
RDE_RDA_CB *rde_rda_cb = _cb->rde_rda_cb;
int term_fd;
-
opensaf_reboot_prepare();
  

Re: [devel] [PATCH 1/5] osaf: add consensus API [#64]

2018-01-23 Thread Anders Widell
Ack for this patch with comments below, marked AndersW>. An additional 
comment is that Makefiles should be updated in the same patch that adds 
the new files, e.g. under osaf/consensus. Currently, the files are added 
in this patch but Makefile.am is updated in patch number 5, so the 
Makefile.am updates should be moved to this patch.


regards,

Anders Widell


On 01/23/2018 09:06 AM, Gary Lee wrote:

---
  src/osaf/consensus/Makefile  |  18 +++
  src/osaf/consensus/keyvalue.cc   | 221 ++
  src/osaf/consensus/keyvalue.h|  66 
  src/osaf/consensus/plugins/etcd.plugin   | 253 ++
  src/osaf/consensus/plugins/sample.plugin | 171 
  src/osaf/consensus/service.cc| 258 +++
  src/osaf/consensus/service.h |  71 +
  7 files changed, 1058 insertions(+)
  create mode 100644 src/osaf/consensus/Makefile
  create mode 100644 src/osaf/consensus/keyvalue.cc
  create mode 100644 src/osaf/consensus/keyvalue.h
  create mode 100644 src/osaf/consensus/plugins/etcd.plugin
  create mode 100644 src/osaf/consensus/plugins/sample.plugin
  create mode 100644 src/osaf/consensus/service.cc
  create mode 100644 src/osaf/consensus/service.h

diff --git a/src/osaf/consensus/Makefile b/src/osaf/consensus/Makefile
new file mode 100644
index 0..a2c8bc9dd
--- /dev/null
+++ b/src/osaf/consensus/Makefile
@@ -0,0 +1,18 @@
+#  -*- OpenSAF  -*-
+#
+# (C) Copyright 2018 The OpenSAF Foundation

AndersW> Should be Copyright Ericsson AB 2018 - All Rights Reserved.

+#
+# This program is distributed in the hope that it will be useful, but
+# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+# or FITNESS FOR A PARTICULAR PURPOSE. This file and program are licensed
+# under the GNU Lesser General Public License Version 2.1, February 1999.
+# The complete license can be accessed from the following location:
+# http://opensource.org/licenses/lgpl-license.php
+# See the Copying file included with the OpenSAF distribution for full
+# licensing terms.
+#
+# Author(s): Ericsson AB

AndersW> Remove the line above.

+#
+
+all:
+   $(MAKE) -C ../.. lib/libconsensus.la
diff --git a/src/osaf/consensus/keyvalue.cc b/src/osaf/consensus/keyvalue.cc
new file mode 100644
index 0..eea518585
--- /dev/null
+++ b/src/osaf/consensus/keyvalue.cc
@@ -0,0 +1,221 @@
+/*  -*- OpenSAF  -*-
+ *
+ * (C) Copyright 2018 The OpenSAF Foundation

AndersW> Should be Copyright Ericsson AB 2018 - All Rights Reserved.


+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+ * or FITNESS FOR A PARTICULAR PURPOSE. This file and program are licensed
+ * under the GNU Lesser General Public License Version 2.1, February 1999.
+ * The complete license can be accessed from the following location:
+ * http://opensource.org/licenses/lgpl-license.php
+ * See the Copying file included with the OpenSAF distribution for full
+ * licensing terms.
+ *
+ * Author(s): Ericsson AB

AndersW> Remove the line above.

+ *
+ */
+#include "osaf/consensus/keyvalue.h"
+#include 
+#include "base/logtrace.h"
+#include "base/getenv.h"
+#include "base/conf.h"

AndersW> The three last lines should be sorted alphabetically.

+
+int KeyValue::Execute(const std::string& command, std::string& output) {
+  TRACE_ENTER();
+  constexpr size_t buf_size = 128;
+  std::array buffer;
+  FILE* pipe = popen(command.c_str(), "r");
+  if (pipe == nullptr) {
+return 1;
+  }
+  output = "";

AndersW> Maybe output.clear() is slightly better?

+  while (feof(pipe) == 0) {
+if (fgets(buffer.data(), buf_size, pipe) != nullptr) {
+  output += buffer.data();
+}
+  }
+  int exit_code = pclose(pipe);
+  exit_code = WEXITSTATUS(exit_code);
+  if (output.empty() == false && isspace(output.back()) != 0) {
+// remove newline at end of output
+output.pop_back();
+  }
+  TRACE("Executed '%s', returning %d", command.c_str(), exit_code);
+  return exit_code;
+}
+
+SaAisErrorT KeyValue::Get(const std::string& key, std::string& value) {
+  TRACE_ENTER();
+
+  const std::string kv_store_cmd = base::GetEnv(
+"FMS_KEYVALUE_STORE_PLUGIN_CMD", "");
+  const std::string command(kv_store_cmd + " get " + key);
+  int rc = KeyValue::Execute(command, value);
+  TRACE("Read '%s'", value.c_str());
+
+  if (rc == 0) {
+return SA_AIS_OK;
+  } else {
+return SA_AIS_ERR_FAILED_OPERATION;
+  }
+}
+
+SaAisErrorT KeyValue::Set(const std::string& key, const std::string& value) {
+  TRACE_ENTER();
+
+  const std::string kv_store_cmd = base::GetEnv(
+"FMS_KEYVALUE_STORE_PLUGIN_CMD", "");
+  const std::string command(kv_store_cmd + " set " + key + " " + value);
+  std::string output;
+  int rc = KeyValue::Execute(command, output);
+
+  if (rc == 0) {
+return SA_AIS_OK;
+  } else {
+return 

Re: [devel] [PATCH 5/5] doc: update README and makefiles [#64]

2018-01-23 Thread Gary Lee

Hi

One comment from Quyen about the readme. Revised paragraph below, please 
see [Gary].


On 23/01/18 19:06, Gary Lee wrote:

---
  00-README.conf   | 56 
  Makefile.am  |  4 +++-
  src/osaf/Makefile.am |  8 ++--
  3 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/00-README.conf b/00-README.conf
index a8848e632..6c3cff1dd 100644
--- a/00-README.conf
+++ b/00-README.conf
@@ -662,3 +662,59 @@ on each node, except on the active node. This file 
indicates that a cluster
  reboot is in progress and all nodes needs to delay their start, this to give
  the active a lead.
  
+Split-Brain Prevention with Consensus Service

+=
+
+OpenSAF implements split-brain prevention by utilizing a consensus service that
+implements a replicated state machine. The consensus service uses quorum to
+prevent state changes in network partitions that don't include more than half
+of the nodes in the cluster. In network partitions containing
+half of the nodes or less, the state is either read-only or unavailable.
+Thus, it is important to keep in mind that the consensus service by itself
+does not prevent the presence of multiple active system




+controller nodes. In the case when the network has been split up into 
partitions
+and the current active system controller no longer has write access to the
+state machine, OpenSAF relies on some additional mechanism like fencing to
+ensure that the current active system controller disappears before a new
+active system controller can be chosen among the nodes that do have write
+access to the replicated state machine. If fencing is not available, the old
+active system controller can detect that it has lost write
+access and step down from its active role.


[Gary] The paragraph above should be changed as we don't currently check 
write access or partition sizes when fencing.


In the case where the network has been split up into partitions,
OpenSAF relies on some additional mechanism like fencing to
ensure that only one active controller exists among the network partitions.
If fencing is not available, the old active system controller can detect 
that it has

lost write access and step down from its active role.


+
+The consensus service can be implemented, for example, using the RAFT 
algorithm.
+When using RAFT, there are mainly three possibilities:
+
+1. The RAFT servers run on the same nodes as OpenSAF
+2. The RAFT servers run on a subset of the OpenSAF nodes
+3. The RAFT servers run on an external set of nodes, outside of the
+   OpenSAF cluster
+
+The consensus services relies on a plugin to communicate with a distributed
+key-value store database. This plugin must still function according to the
+API when the network has split up into partitions.
+The plugin interface is defined in src/osaf/consensus/plugins/sample.plugin
+
+An implementation for etcdv2 is provided. It assumes etcd is installed
+and configured on all system controllers. In clusters where
+there are only two system controllers, it is highly recommended to
+configure etcd so it runs on at least three nodes to facilitate
+a majority vote with failure tolerance.
+
+Other implementations of a distributed key-value store service
+can be used, provided as it implements the interface documented in 
sample.plugin
+
+To enable split-brain prevention, edit fmd.conf and update accordingly:
+
+export FMS_SPLIT_BRAIN_PREVENTION=1
+export FMS_KEYVALUE_STORE_PLUGIN_CMD=/usr/local/lib/opensaf/etcd.plugin
+
+As discussed, the key-value store does not need to reside on the same nodes
+as OpenSAF. In such a configuration, an appropriate plugin that handles
+the communication with a remotely located key-value store, must be provided.
+
+If remote fencing is enabled, then it will be used to fence a node that the
+consensus service believes should not be active. Otherwise, rded/amfd will
+initiate a 'self-fencing' by rebooting the node, if it determines the node
+should no longer be active according to the consensus service, to prevent
+a split-brain situation.
+
diff --git a/Makefile.am b/Makefile.am
index bcfd844cd..57c2585a8 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -159,7 +159,9 @@ dist_osaf_execbin_SCRIPTS += \
$(top_srcdir)/scripts/opensaf_reboot \
$(top_srcdir)/scripts/opensaf_sc_active \
$(top_srcdir)/scripts/opensaf_scale_out \
-   $(top_srcdir)/scripts/plm_scale_out
+   $(top_srcdir)/scripts/plm_scale_out \
+   $(top_srcdir)/src/osaf/consensus/plugins/etcd.plugin
+# TODO remove above line before pushing
  
  include $(top_srcdir)/src/ais/Makefile.am

  include $(top_srcdir)/src/base/Makefile.am
diff --git a/src/osaf/Makefile.am b/src/osaf/Makefile.am
index 05b78c988..10bbe427b 100644
--- a/src/osaf/Makefile.am
+++ b/src/osaf/Makefile.am
@@ -16,7 +16,9 @@
  
  noinst_HEADERS += \

src/osaf/immutil/immutil.h \
-   src/osaf/saflog/saflog.h
+   

[devel] [PATCH 5/5] doc: update README and makefiles [#64]

2018-01-23 Thread Gary Lee
---
 00-README.conf   | 56 
 Makefile.am  |  4 +++-
 src/osaf/Makefile.am |  8 ++--
 3 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/00-README.conf b/00-README.conf
index a8848e632..6c3cff1dd 100644
--- a/00-README.conf
+++ b/00-README.conf
@@ -662,3 +662,59 @@ on each node, except on the active node. This file 
indicates that a cluster
 reboot is in progress and all nodes needs to delay their start, this to give
 the active a lead.
 
+Split-Brain Prevention with Consensus Service
+=
+
+OpenSAF implements split-brain prevention by utilizing a consensus service that
+implements a replicated state machine. The consensus service uses quorum to
+prevent state changes in network partitions that don't include more than half
+of the nodes in the cluster. In network partitions containing
+half of the nodes or less, the state is either read-only or unavailable.
+Thus, it is important to keep in mind that the consensus service by itself
+does not prevent the presence of multiple active system
+controller nodes. In the case when the network has been split up into 
partitions
+and the current active system controller no longer has write access to the
+state machine, OpenSAF relies on some additional mechanism like fencing to
+ensure that the current active system controller disappears before a new
+active system controller can be chosen among the nodes that do have write
+access to the replicated state machine. If fencing is not available, the old
+active system controller can detect that it has lost write
+access and step down from its active role.
+
+The consensus service can be implemented, for example, using the RAFT 
algorithm.
+When using RAFT, there are mainly three possibilities:
+
+1. The RAFT servers run on the same nodes as OpenSAF
+2. The RAFT servers run on a subset of the OpenSAF nodes
+3. The RAFT servers run on an external set of nodes, outside of the
+   OpenSAF cluster
+
+The consensus services relies on a plugin to communicate with a distributed
+key-value store database. This plugin must still function according to the
+API when the network has split up into partitions.
+The plugin interface is defined in src/osaf/consensus/plugins/sample.plugin
+
+An implementation for etcdv2 is provided. It assumes etcd is installed
+and configured on all system controllers. In clusters where
+there are only two system controllers, it is highly recommended to
+configure etcd so it runs on at least three nodes to facilitate
+a majority vote with failure tolerance.
+
+Other implementations of a distributed key-value store service
+can be used, provided as it implements the interface documented in 
sample.plugin
+
+To enable split-brain prevention, edit fmd.conf and update accordingly:
+
+export FMS_SPLIT_BRAIN_PREVENTION=1
+export FMS_KEYVALUE_STORE_PLUGIN_CMD=/usr/local/lib/opensaf/etcd.plugin
+
+As discussed, the key-value store does not need to reside on the same nodes
+as OpenSAF. In such a configuration, an appropriate plugin that handles
+the communication with a remotely located key-value store, must be provided.
+
+If remote fencing is enabled, then it will be used to fence a node that the
+consensus service believes should not be active. Otherwise, rded/amfd will
+initiate a 'self-fencing' by rebooting the node, if it determines the node
+should no longer be active according to the consensus service, to prevent
+a split-brain situation.
+
diff --git a/Makefile.am b/Makefile.am
index bcfd844cd..57c2585a8 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -159,7 +159,9 @@ dist_osaf_execbin_SCRIPTS += \
$(top_srcdir)/scripts/opensaf_reboot \
$(top_srcdir)/scripts/opensaf_sc_active \
$(top_srcdir)/scripts/opensaf_scale_out \
-   $(top_srcdir)/scripts/plm_scale_out
+   $(top_srcdir)/scripts/plm_scale_out \
+   $(top_srcdir)/src/osaf/consensus/plugins/etcd.plugin
+# TODO remove above line before pushing
 
 include $(top_srcdir)/src/ais/Makefile.am
 include $(top_srcdir)/src/base/Makefile.am
diff --git a/src/osaf/Makefile.am b/src/osaf/Makefile.am
index 05b78c988..10bbe427b 100644
--- a/src/osaf/Makefile.am
+++ b/src/osaf/Makefile.am
@@ -16,7 +16,9 @@
 
 noinst_HEADERS += \
src/osaf/immutil/immutil.h \
-   src/osaf/saflog/saflog.h
+   src/osaf/saflog/saflog.h \
+   src/osaf/consensus/keyvalue.h \
+   src/osaf/consensus/service.h
 
 pkglib_LTLIBRARIES += lib/libosaf_common.la
 
@@ -33,7 +35,9 @@ lib_libosaf_common_la_LDFLAGS = \
 
 lib_libosaf_common_la_SOURCES = \
src/osaf/immutil/immutil.c \
-   src/osaf/saflog/saflog.c
+   src/osaf/saflog/saflog.c \
+   src/osaf/consensus/keyvalue.cc \
+   src/osaf/consensus/service.cc
 
 nodist_EXTRA_lib_libosaf_common_la_SOURCES = dummy.cc
 
-- 
2.14.1


--
Check out the vibrant tech community on 

[devel] [PATCH 4/5] fmd: update consensus service during controller failover [#64]

2018-01-23 Thread Gary Lee
---
 src/fm/Makefile.am|  1 +
 src/fm/fmd/fm_main.cc | 37 +++--
 src/fm/fmd/fm_rda.cc  | 13 +
 src/fm/fmd/fmd.conf   |  6 ++
 4 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/src/fm/Makefile.am b/src/fm/Makefile.am
index d48a9146c..0f254b94f 100644
--- a/src/fm/Makefile.am
+++ b/src/fm/Makefile.am
@@ -49,4 +49,5 @@ bin_osaffmd_SOURCES = \
 bin_osaffmd_LDADD = \
lib/libSaAmf.la \
lib/libSaClm.la \
+   lib/libosaf_common.la \
lib/libopensaf_core.la
diff --git a/src/fm/fmd/fm_main.cc b/src/fm/fmd/fm_main.cc
index db8395ee7..74517b3b5 100644
--- a/src/fm/fmd/fm_main.cc
+++ b/src/fm/fmd/fm_main.cc
@@ -28,7 +28,8 @@ This file contains the main() routine for FM.
 #include 
 #include "base/daemon.h"
 #include "base/logtrace.h"
-
+#include "base/osaf_extended_name.h"
+#include "osaf/consensus/service.h"
 #include "nid/agent/nid_api.h"
 #include "fm.h"
 #include "base/osaf_time.h"
@@ -553,6 +554,8 @@ static void fm_mbx_msg_handler(FM_CB *fm_cb, FM_EVT 
*fm_mbx_evt)
TRACE_ENTER();
switch (fm_mbx_evt->evt_code) {
case FM_EVT_NODE_DOWN:
+   {
+   Consensus consensus_service;
LOG_NO("Current role: %s", role_string[fm_cb->role]);
if ((fm_mbx_evt->node_id == fm_cb->peer_node_id)) {
/* Check whether node(AMF) initialization is done */
@@ -593,15 +596,27 @@ static void fm_mbx_msg_handler(FM_CB *fm_cb, FM_EVT 
*fm_mbx_evt)
 * trigerred quicker than the node_down event
 * has been received.
 */
+   if (fm_cb->role == PCS_RDA_STANDBY) {
+   const std::string current_active = 
consensus_service.CurrentActive();
+   if (current_active.compare(
+   
osaf_extended_name_borrow(_cb->peer_node_name)) == 0) {
+   // update consensus service, 
before fencing old active controller
+   
consensus_service.DemoteCurrentActive();
+   }
+   }
+
if (fm_cb->use_remote_fencing) {
if (fm_cb->peer_node_terminated ==
false) {
+   // if peer_sc_up is true then
+   // the node has come up already
+   if (fm_cb->peer_sc_up == false 
&& fm_cb->immnd_down == true) {
opensaf_reboot(
-   fm_cb->peer_node_id,
-   (char *)fm_cb
-   ->peer_clm_node_name
-   .value,
-   "Received Node Down for 
peer controller");
+   fm_cb->peer_node_id,
+   (char *)fm_cb
+   
->peer_clm_node_name.value,
+   "Received Node Down for 
peer controller");
+   }
} else {
LOG_NO(
"Peer node %s is 
terminated, fencing will not be performed",
@@ -624,6 +639,7 @@ static void fm_mbx_msg_handler(FM_CB *fm_cb, FM_EVT 
*fm_mbx_evt)
}
}
}
+   }
break;
 
case FM_EVT_PEER_UP:
@@ -659,6 +675,15 @@ static void fm_mbx_msg_handler(FM_CB *fm_cb, FM_EVT 
*fm_mbx_evt)
0, NULL,
"Failover occurred, but this node is not 
yet ready");
}
+
+   Consensus consensus_service;
+   const std::string current_active = 
consensus_service.CurrentActive();
+   if (current_active.compare(
+   
osaf_extended_name_borrow(_cb->peer_node_name)) == 0) {
+   // update consensus service, before fencing old 
active controller
+   consensus_service.DemoteCurrentActive();
+   }
+
/* Now. Try resetting other blade */
fm_cb->role = PCS_RDA_ACTIVE;
 
diff --git a/src/fm/fmd/fm_rda.cc b/src/fm/fmd/fm_rda.cc
index 

[devel] [PATCH 1/5] osaf: add consensus API [#64]

2018-01-23 Thread Gary Lee
---
 src/osaf/consensus/Makefile  |  18 +++
 src/osaf/consensus/keyvalue.cc   | 221 ++
 src/osaf/consensus/keyvalue.h|  66 
 src/osaf/consensus/plugins/etcd.plugin   | 253 ++
 src/osaf/consensus/plugins/sample.plugin | 171 
 src/osaf/consensus/service.cc| 258 +++
 src/osaf/consensus/service.h |  71 +
 7 files changed, 1058 insertions(+)
 create mode 100644 src/osaf/consensus/Makefile
 create mode 100644 src/osaf/consensus/keyvalue.cc
 create mode 100644 src/osaf/consensus/keyvalue.h
 create mode 100644 src/osaf/consensus/plugins/etcd.plugin
 create mode 100644 src/osaf/consensus/plugins/sample.plugin
 create mode 100644 src/osaf/consensus/service.cc
 create mode 100644 src/osaf/consensus/service.h

diff --git a/src/osaf/consensus/Makefile b/src/osaf/consensus/Makefile
new file mode 100644
index 0..a2c8bc9dd
--- /dev/null
+++ b/src/osaf/consensus/Makefile
@@ -0,0 +1,18 @@
+#  -*- OpenSAF  -*-
+#
+# (C) Copyright 2018 The OpenSAF Foundation
+#
+# This program is distributed in the hope that it will be useful, but
+# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+# or FITNESS FOR A PARTICULAR PURPOSE. This file and program are licensed
+# under the GNU Lesser General Public License Version 2.1, February 1999.
+# The complete license can be accessed from the following location:
+# http://opensource.org/licenses/lgpl-license.php
+# See the Copying file included with the OpenSAF distribution for full
+# licensing terms.
+#
+# Author(s): Ericsson AB
+#
+
+all:
+   $(MAKE) -C ../.. lib/libconsensus.la
diff --git a/src/osaf/consensus/keyvalue.cc b/src/osaf/consensus/keyvalue.cc
new file mode 100644
index 0..eea518585
--- /dev/null
+++ b/src/osaf/consensus/keyvalue.cc
@@ -0,0 +1,221 @@
+/*  -*- OpenSAF  -*-
+ *
+ * (C) Copyright 2018 The OpenSAF Foundation
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+ * or FITNESS FOR A PARTICULAR PURPOSE. This file and program are licensed
+ * under the GNU Lesser General Public License Version 2.1, February 1999.
+ * The complete license can be accessed from the following location:
+ * http://opensource.org/licenses/lgpl-license.php
+ * See the Copying file included with the OpenSAF distribution for full
+ * licensing terms.
+ *
+ * Author(s): Ericsson AB
+ *
+ */
+#include "osaf/consensus/keyvalue.h"
+#include 
+#include "base/logtrace.h"
+#include "base/getenv.h"
+#include "base/conf.h"
+
+int KeyValue::Execute(const std::string& command, std::string& output) {
+  TRACE_ENTER();
+  constexpr size_t buf_size = 128;
+  std::array buffer;
+  FILE* pipe = popen(command.c_str(), "r");
+  if (pipe == nullptr) {
+return 1;
+  }
+  output = "";
+  while (feof(pipe) == 0) {
+if (fgets(buffer.data(), buf_size, pipe) != nullptr) {
+  output += buffer.data();
+}
+  }
+  int exit_code = pclose(pipe);
+  exit_code = WEXITSTATUS(exit_code);
+  if (output.empty() == false && isspace(output.back()) != 0) {
+// remove newline at end of output
+output.pop_back();
+  }
+  TRACE("Executed '%s', returning %d", command.c_str(), exit_code);
+  return exit_code;
+}
+
+SaAisErrorT KeyValue::Get(const std::string& key, std::string& value) {
+  TRACE_ENTER();
+
+  const std::string kv_store_cmd = base::GetEnv(
+"FMS_KEYVALUE_STORE_PLUGIN_CMD", "");
+  const std::string command(kv_store_cmd + " get " + key);
+  int rc = KeyValue::Execute(command, value);
+  TRACE("Read '%s'", value.c_str());
+
+  if (rc == 0) {
+return SA_AIS_OK;
+  } else {
+return SA_AIS_ERR_FAILED_OPERATION;
+  }
+}
+
+SaAisErrorT KeyValue::Set(const std::string& key, const std::string& value) {
+  TRACE_ENTER();
+
+  const std::string kv_store_cmd = base::GetEnv(
+"FMS_KEYVALUE_STORE_PLUGIN_CMD", "");
+  const std::string command(kv_store_cmd + " set " + key + " " + value);
+  std::string output;
+  int rc = KeyValue::Execute(command, output);
+
+  if (rc == 0) {
+return SA_AIS_OK;
+  } else {
+return SA_AIS_ERR_FAILED_OPERATION;
+  }
+}
+
+SaAisErrorT KeyValue::Erase(const std::string& key) {
+  TRACE_ENTER();
+
+  const std::string kv_store_cmd = base::GetEnv(
+"FMS_KEYVALUE_STORE_PLUGIN_CMD", "");
+  const std::string command(kv_store_cmd + " erase " + key);
+  std::string output;
+  int rc = KeyValue::Execute(command, output);
+
+  if (rc == 0) {
+return SA_AIS_OK;
+  } else {
+return SA_AIS_ERR_FAILED_OPERATION;
+  }
+}
+
+SaAisErrorT KeyValue::Lock(const std::string& owner,
+ const unsigned int timeout) {
+  TRACE_ENTER();
+
+  const std::string kv_store_cmd = base::GetEnv(
+"FMS_KEYVALUE_STORE_PLUGIN_CMD", "");
+  const std::string command(kv_store_cmd + " lock " + owner + " " +
+

[devel] [PATCH 2/5] rded: add split brain prevention support [#64]

2018-01-23 Thread Gary Lee
* consult with consensus service before promoting node to active
* add watch thread and self-fence if it detects active controller
  has been changed (if remote fencing is disabled)
---
 src/rde/Makefile.am   |  3 ++-
 src/rde/rded/osaf-rded.in |  4 
 src/rde/rded/rde_cb.h |  4 +++-
 src/rde/rded/rde_main.cc  | 38 +-
 src/rde/rded/role.cc  | 45 -
 src/rde/rded/role.h   |  3 +++
 6 files changed, 89 insertions(+), 8 deletions(-)

diff --git a/src/rde/Makefile.am b/src/rde/Makefile.am
index c967f9fc4..182f347ab 100644
--- a/src/rde/Makefile.am
+++ b/src/rde/Makefile.am
@@ -58,7 +58,8 @@ bin_osafrded_SOURCES = \
 
 bin_osafrded_LDADD = \
lib/libSaAmf.la \
-   lib/libopensaf_core.la
+   lib/libopensaf_core.la \
+   lib/libosaf_common.la
 
 bin_rdegetrole_CPPFLAGS = \
$(AM_CPPFLAGS)
diff --git a/src/rde/rded/osaf-rded.in b/src/rde/rded/osaf-rded.in
index 1c1786c8d..1697936a7 100644
--- a/src/rde/rded/osaf-rded.in
+++ b/src/rde/rded/osaf-rded.in
@@ -28,6 +28,10 @@ else
. $pkgsysconfdir/rde.conf
 fi 
 
+if [ -f "$pkgsysconfdir/fmd.conf" ]; then
+  . "$pkgsysconfdir/fmd.conf"
+fi
+
 binary=$pkglibdir/$osafprog
 pidfile=$pkgpiddir/$osafprog.pid
 tracefile=$pkglogdir/$osafprog.log
diff --git a/src/rde/rded/rde_cb.h b/src/rde/rded/rde_cb.h
index d2a3d46b2..fc100849a 100644
--- a/src/rde/rded/rde_cb.h
+++ b/src/rde/rded/rde_cb.h
@@ -39,13 +39,15 @@ struct RDE_CONTROL_BLOCK {
   bool task_terminate;
   RDE_RDA_CB rde_rda_cb;
   RDE_AMF_CB rde_amf_cb;
+  bool monitor_lock_thread_running;
 };
 
 enum RDE_MSG_TYPE {
   RDE_MSG_PEER_UP = 1,
   RDE_MSG_PEER_DOWN = 2,
   RDE_MSG_PEER_INFO_REQ = 3,
-  RDE_MSG_PEER_INFO_RESP = 4
+  RDE_MSG_PEER_INFO_RESP = 4,
+  RDE_MSG_NEW_ACTIVE_CALLBACK = 5
 };
 
 struct rde_peer_info {
diff --git a/src/rde/rded/rde_main.cc b/src/rde/rded/rde_main.cc
index 0298bf3ff..082c1c040 100644
--- a/src/rde/rded/rde_main.cc
+++ b/src/rde/rded/rde_main.cc
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include "osaf/consensus/service.h"
 #include "base/daemon.h"
 #include "base/logtrace.h"
 #include "base/osaf_poll.h"
@@ -37,6 +38,7 @@
 #include 
 #include "rde/rded/rde_cb.h"
 #include "rde/rded/role.h"
+#include "base/conf.h"
 
 #define RDA_MAX_CLIENTS 32
 
@@ -92,10 +94,6 @@ static void handle_mbx_event() {
   TRACE_ENTER();
 
   msg = reinterpret_cast(ncs_ipc_non_blk_recv(_cb->mbx));
-  TRACE("Received %s from node 0x%x with state %s. My state is %s",
-rde_msg_name[msg->type], msg->fr_node_id,
-Role::to_string(msg->info.peer_info.ha_role),
-Role::to_string(role->role()));
 
   switch (msg->type) {
 case RDE_MSG_PEER_INFO_REQ:
@@ -118,6 +116,34 @@ static void handle_mbx_event() {
 case RDE_MSG_PEER_DOWN:
   LOG_NO("Peer down on node 0x%x", msg->fr_node_id);
   break;
+   case RDE_MSG_NEW_ACTIVE_CALLBACK:
+  {
+const std::string my_node = base::Conf::NodeName();
+rde_cb->monitor_lock_thread_running = false;
+
+// get current active controller
+Consensus consensus_service;
+std::string active_controller = consensus_service.CurrentActive();
+
+LOG_NO("New active controller notification from consensus service");
+
+if (role->role() == PCS_RDA_ACTIVE) {
+  if (my_node.compare(active_controller) != 0) {
+// we are meant to be active, but consensus service doesn't think 
so
+LOG_WA("Role does not match consensus service. New controller: %s",
+  active_controller.c_str());
+if (consensus_service.IsRemoteFencingEnabled() == false ) {
+  LOG_ER("Probable split-brain. Rebooting this node");
+  opensaf_reboot(0, nullptr, "Split-brain detected by consensus 
service");
+}
+  }
+
+  // register for callback
+  rde_cb->monitor_lock_thread_running = true;
+  consensus_service.MonitorLock(Role::MonitorCallback, rde_cb->mbx);
+}
+  }
+  break;
 default:
   LOG_ER("%s: discarding unknown message type %u", __FUNCTION__, 
msg->type);
   break;
@@ -192,6 +218,7 @@ static int initialize_rde() {
 goto init_failed;
   }
 
+  rde_cb->monitor_lock_thread_running = false;
   rc = NCSCC_RC_SUCCESS;
 
 init_failed:
@@ -205,11 +232,12 @@ int main(int argc, char *argv[]) {
   NCS_SEL_OBJ mbx_sel_obj;
   RDE_RDA_CB *rde_rda_cb = _cb->rde_rda_cb;
   int term_fd;
-
   opensaf_reboot_prepare();
 
   daemonize(argc, argv);
 
+  base::Conf::InitNodeName();
+
   if (initialize_rde() != NCSCC_RC_SUCCESS) goto init_failed;
 
   mbx_sel_obj = ncs_ipc_get_sel_obj(_cb->mbx);
diff --git a/src/rde/rded/role.cc b/src/rde/rded/role.cc
index f7511f0d8..c821aeb33 100644
--- a/src/rde/rded/role.cc
+++ b/src/rde/rded/role.cc
@@ -27,7 +27,9 @@
 #include "base/process.h"
 #include "base/time.h"
 #include "base/ncs_main_papi.h"
+#include "base/ncssysf_def.h"
 

[devel] [PATCH 0/5] Review Request for Add support for split brain prevention V3 [#64]

2018-01-23 Thread Gary Lee
Summary: Add support for split brain prevention V3 [#64] 
Review request for Ticket(s): 64
Peer Reviewer(s): Anders, Hans 
Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE ***
Affected branch(es): develop
Development branch: ticket-64
Base revision: c5db5d0352af060cb94028b3b9b95e54d87cffbd
Personal repository: git://git.code.sf.net/u/userid-2226215/review


Impacted area   Impact y/n

 Docsy 
 Build systemy 
 RPM/packaging   n
 Configuration files y 
 Startup scripts n
 SAF servicesy 
 OpenSAF servicesy 
 Core libraries  y 
 Samples n
 Tests   n
 Other   n


Comments (indicate scope for each "y" above):
-

Changes from V2:

osaf/consensus: remove opensaf_active_controller key
plugin: single persisent key used for lock
plugin: etcd plugin changed to put all opensaf related data under /opensaf
plugin: watch_lock function added to etcd plugin / sample plugin
rded: flag added to rded to ensure we don't launch multiple monitor threads


revision 39f15d483557e6b6623d9ccdab1bb5f02599d95d
Author: Gary Lee 
Date:   Tue, 23 Jan 2018 18:58:50 +1100

doc: update README and makefiles [#64]



revision 9d31819b0c3e831372e58b56a311509b76c68634
Author: Gary Lee 
Date:   Tue, 23 Jan 2018 18:58:37 +1100

fmd: update consensus service during controller failover [#64]



revision f03a578e94f964fbbcf5a217f95122e17d474dcd
Author: Gary Lee 
Date:   Tue, 23 Jan 2018 18:58:08 +1100

amfd: update consensus service when performing SI swap [#64]

When a node goes down and split-brain prevention is enabled,
check that we still have write access to the consensus service.
If not and fencing is disabled, reboot the node to prevent
split brain.



revision c18b731daa26a5296468ad81cc30d1107e54c13f
Author: Gary Lee 
Date:   Tue, 23 Jan 2018 18:57:46 +1100

rded: add split brain prevention support [#64]

* consult with consensus service before promoting node to active
* add watch thread and self-fence if it detects active controller
  has been changed (if remote fencing is disabled)



revision a72528ea820ea0229b5034c45c22f8bb93b88986
Author: Gary Lee 
Date:   Tue, 23 Jan 2018 18:57:27 +1100

osaf: add consensus API [#64]



Added Files:

 src/osaf/consensus/keyvalue.cc
 src/osaf/consensus/keyvalue.h
 src/osaf/consensus/Makefile
 src/osaf/consensus/plugins/etcd.plugin
 src/osaf/consensus/plugins/sample.plugin
 src/osaf/consensus/service.cc
 src/osaf/consensus/service.h


Complete diffstat:
--
 00-README.conf   |  56 +++
 Makefile.am  |   4 +-
 src/amf/amfd/ndproc.cc   |  12 +-
 src/amf/amfd/osaf-amfd.in|   4 +
 src/amf/amfd/role.cc |  30 +++-
 src/fm/Makefile.am   |   1 +
 src/fm/fmd/fm_main.cc|  37 -
 src/fm/fmd/fm_rda.cc |  13 ++
 src/fm/fmd/fmd.conf  |   6 +
 src/osaf/Makefile.am |   8 +-
 src/osaf/consensus/Makefile  |  18 +++
 src/osaf/consensus/keyvalue.cc   | 221 ++
 src/osaf/consensus/keyvalue.h|  66 
 src/osaf/consensus/plugins/etcd.plugin   | 253 ++
 src/osaf/consensus/plugins/sample.plugin | 171 
 src/osaf/consensus/service.cc| 258 +++
 src/osaf/consensus/service.h |  71 +
 src/rde/Makefile.am  |   3 +-
 src/rde/rded/osaf-rded.in|   4 +
 src/rde/rded/rde_cb.h|   4 +-
 src/rde/rded/rde_main.cc |  38 -
 src/rde/rded/role.cc |  45 +-
 src/rde/rded/role.h  |   3 +
 23 files changed, 1303 insertions(+), 23 deletions(-)


Testing Commands:
-
*** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES ***


Testing, Expected Results:
--
*** PASTE COMMAND OUTPUTS / TEST RESULTS ***


Conditions of Submission:
-
*** HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC ***


Arch  Built StartedLinux distro
---
mipsn  n
mips64  n  n
x86 n  n
x86_64  y  y 
powerpc n  n
powerpc64   n  n


Reviewer Checklist:
---
[Submitters: make sure that your review doesn't trigger any checkmarks!]


Your checkin has not passed review because (see checked entries):

___ Your RR template is generally incomplete; it has too many blank entries
that need proper data filled 

[devel] [PATCH 3/5] amfd: update consensus service when performing SI swap [#64]

2018-01-23 Thread Gary Lee
When a node goes down and split-brain prevention is enabled,
check that we still have write access to the consensus service.
If not and fencing is disabled, reboot the node to prevent
split brain.
---
 src/amf/amfd/ndproc.cc| 12 +++-
 src/amf/amfd/osaf-amfd.in |  4 
 src/amf/amfd/role.cc  | 30 +-
 3 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/src/amf/amfd/ndproc.cc b/src/amf/amfd/ndproc.cc
index 0c6316627..df68b3dbf 100644
--- a/src/amf/amfd/ndproc.cc
+++ b/src/amf/amfd/ndproc.cc
@@ -32,8 +32,8 @@
  */
 
 #include "osaf/immutil/immutil.h"
+#include "osaf/consensus/service.h"
 #include "base/logtrace.h"
-
 #include "amf/amfd/amfd.h"
 #include "amf/amfd/imm.h"
 #include "amf/amfd/cluster.h"
@@ -1221,5 +1221,15 @@ void avd_node_failover(AVD_AVND *node) {
   avd_pg_node_csi_del_all(avd_cb, node);
   avd_node_down_mw_susi_failover(avd_cb, node);
   avd_node_down_appl_susi_failover(avd_cb, node);
+
+  Consensus consensus_service;
+  if (consensus_service.IsRemoteFencingEnabled() == false &&
+  consensus_service.IsWritable() == false) {
+// remote fencing is disabled and we have lost write access
+// reboot this node to prevent split brain
+opensaf_reboot(0, nullptr,
+  "Quorum lost. Rebooting this node to prevent split-brain");
+  }
+
   TRACE_LEAVE();
 }
diff --git a/src/amf/amfd/osaf-amfd.in b/src/amf/amfd/osaf-amfd.in
index 45c5ab9e4..26a77ef52 100644
--- a/src/amf/amfd/osaf-amfd.in
+++ b/src/amf/amfd/osaf-amfd.in
@@ -28,6 +28,10 @@ else
. $pkgsysconfdir/amfd.conf
 fi 
 
+if [ -f "$pkgsysconfdir/fmd.conf" ]; then
+  . "$pkgsysconfdir/fmd.conf"
+fi
+
 binary=$pkglibdir/$osafprog
 pidfile=$pkgpiddir/$osafprog.pid
 lockfile=$lockdir/$initscript
diff --git a/src/amf/amfd/role.cc b/src/amf/amfd/role.cc
index 865d89d94..862ac3653 100644
--- a/src/amf/amfd/role.cc
+++ b/src/amf/amfd/role.cc
@@ -38,6 +38,7 @@
 #include "osaf/immutil/immutil.h"
 #include "base/logtrace.h"
 #include "rde/agent/rda_papi.h"
+#include "osaf/consensus/service.h"
 
 #include "amf/amfd/amfd.h"
 #include "amf/amfd/imm.h"
@@ -1085,6 +1086,12 @@ uint32_t amfd_switch_actv_qsd(AVD_CL_CB *cb) {
 avd_d2n_msg_dequeue(cb);
   }
 
+  Consensus consensus_service;
+  rc = consensus_service.DemoteThisNode();
+  if (rc != SA_AIS_OK) {
+LOG_ER("Failed to demote this node from consensus service");
+  }
+
   TRACE_LEAVE();
   return NCSCC_RC_SUCCESS;
 }
@@ -1209,13 +1216,21 @@ uint32_t amfd_switch_stdby_actv(AVD_CL_CB *cb) {
   cb->avail_state_avd = SA_AMF_HA_ACTIVE;
   osaf_mutex_unlock_ordie(_reinit_mutex);
 
+  Consensus consensus_service;
+  rc = consensus_service.PromoteThisNode();
+  if (rc != SA_AIS_OK) {
+LOG_ER("Unable to set active controller in consensus service");
+osafassert(false);
+  }
+
   /* Declare this standby as Active. Set Vdest role role */
   if (NCSCC_RC_SUCCESS !=
   (status = avd_mds_set_vdest_role(cb, SA_AMF_HA_ACTIVE))) {
 LOG_ER("Switch Standby --> Active FAILED, MDS role set failed");
 cb->swap_switch = false;
 avd_d2d_chg_role_rsp(cb, NCSCC_RC_FAILURE, SA_AMF_HA_ACTIVE);
-return NCSCC_RC_FAILURE;
+status = NCSCC_RC_FAILURE;
+goto done;
   }
 
   /* Time to send fail-over messages to all the AVND's */
@@ -1240,7 +1255,8 @@ uint32_t amfd_switch_stdby_actv(AVD_CL_CB *cb) {
 } else {
   cb->swap_switch = false;
   avd_d2d_chg_role_rsp(cb, NCSCC_RC_FAILURE, SA_AMF_HA_ACTIVE);
-  return NCSCC_RC_FAILURE;
+  status = NCSCC_RC_FAILURE;
+  goto done;
 }
   }
 
@@ -1259,7 +1275,8 @@ uint32_t amfd_switch_stdby_actv(AVD_CL_CB *cb) {
  in avd_imm_reinit_bg_thread.*/
 } else {
   avd_d2d_chg_role_rsp(cb, NCSCC_RC_FAILURE, SA_AMF_HA_ACTIVE);
-  return NCSCC_RC_FAILURE;
+  status = NCSCC_RC_FAILURE;
+  goto done;
 }
   } else
 osaf_mutex_unlock_ordie(_reinit_mutex);
@@ -1274,7 +1291,8 @@ uint32_t amfd_switch_stdby_actv(AVD_CL_CB *cb) {
 LOG_ER("Switch Standby --> Active, clm track start failed");
 Fifo::queue(new ClmTrackStart());
 avd_d2d_chg_role_rsp(cb, NCSCC_RC_FAILURE, SA_AMF_HA_ACTIVE);
-return NCSCC_RC_FAILURE;
+status = NCSCC_RC_FAILURE;
+goto done;
   }
 
   /* Send the message to other avd for role change rsp as success */
@@ -1291,8 +1309,10 @@ uint32_t amfd_switch_stdby_actv(AVD_CL_CB *cb) {
 }
   }
 
+  status = NCSCC_RC_SUCCESS;
+done:
   TRACE_LEAVE();
-  return NCSCC_RC_SUCCESS;
+  return status;
 }
 
 /\
-- 
2.14.1


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel