[devel] [PATCH 0/2] Review Request for osaf: ensure takeover_requests have a lease [#2954]
Summary: osaf: ensure takeover_requests have a lease [#2954] Review request for Ticket(s): 2954 Peer Reviewer(s): Canh, Hans, Nagu Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-2954 Base revision: 5bb2174a323a97f626ce354d553a1dc4d1673899 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesn OpenSAF servicesn Core libraries y Samples n Tests n Other n Comments (indicate scope for each "y" above): - Acknowledgement: Canh noticed this problem and did the initial patches revision 81650ab440f6b140fd4d5d485d13e573b4e070c0 Author: Gary Lee Date: Tue, 13 Nov 2018 07:30:45 + osaf: update etcd2 and sample plugins [#2954] add timeout parameter to set and set_if_prev revision f3a64996eeb739904c7d8e1abcf4f469eeea78b4 Author: Gary Lee Date: Tue, 13 Nov 2018 06:38:27 + osaf: ensure takeover_requests have a lease [#2954] In CreateTakeoverRequest(), if the initial attempt fails, then the takeover_request is created without a lease. Furthermore, when the takeover_request result is set, it is being set without a lease, and the takeover_request is not automatically removed. Add parameter to KeyValue::Set, and remove default value for the parameter to KeyValue::Create to ensure a timeout is always set. Complete diffstat: -- src/osaf/consensus/consensus.cc | 10 --- src/osaf/consensus/key_value.cc | 11 +--- src/osaf/consensus/key_value.h | 8 -- src/osaf/consensus/plugins/etcd.plugin | 20 +++-- src/osaf/consensus/plugins/etcd3.plugin | 48 ++-- src/osaf/consensus/plugins/sample.plugin | 16 +++ 6 files changed, 79 insertions(+), 34 deletions(-) Testing Commands: - *** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES *** Testing, Expected Results: -- *** PASTE COMMAND OUTPUTS / TEST RESULTS *** Conditions of Submission: - *** HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC *** Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user
[devel] [PATCH 1/2] osaf: ensure takeover_requests have a lease [#2954]
In CreateTakeoverRequest(), if the initial attempt fails, then the takeover_request is created without a lease. Furthermore, when the takeover_request result is set, it is being set without a lease, and the takeover_request is not automatically removed. Add parameter to KeyValue::Set, and remove default value for the parameter to KeyValue::Create to ensure a timeout is always set. --- src/osaf/consensus/consensus.cc | 10 +++--- src/osaf/consensus/key_value.cc | 11 +++--- src/osaf/consensus/key_value.h | 8 +++-- src/osaf/consensus/plugins/etcd3.plugin | 48 - 4 files changed, 57 insertions(+), 20 deletions(-) diff --git a/src/osaf/consensus/consensus.cc b/src/osaf/consensus/consensus.cc index 1136c3724..112af7df0 100644 --- a/src/osaf/consensus/consensus.cc +++ b/src/osaf/consensus/consensus.cc @@ -181,11 +181,11 @@ bool Consensus::IsWritable() const { SaAisErrorT rc; uint32_t retries = 0; constexpr uint32_t kMaxTestRetry = 3; - rc = KeyValue::Set(kTestKeyname, base::Conf::NodeName()); + rc = KeyValue::Set(kTestKeyname, base::Conf::NodeName(), 0); while (rc != SA_AIS_OK && retries < kMaxTestRetry) { ++retries; std::this_thread::sleep_for(kSleepInterval); -rc = KeyValue::Set(kTestKeyname, base::Conf::NodeName()); +rc = KeyValue::Set(kTestKeyname, base::Conf::NodeName(), 0); } if (rc == SA_AIS_OK) { @@ -338,7 +338,8 @@ SaAisErrorT Consensus::CreateTakeoverRequest(const std::string& current_owner, while (rc == SA_AIS_ERR_FAILED_OPERATION && retries < kMaxRetry) { ++retries; std::this_thread::sleep_for(kSleepInterval); -rc = KeyValue::Create(kTakeoverRequestKeyname, takeover_request); +rc = KeyValue::Create(kTakeoverRequestKeyname, takeover_request, + takeover_valid_time); } if (rc == SA_AIS_ERR_EXIST) { @@ -435,7 +436,8 @@ SaAisErrorT Consensus::WriteTakeoverResult( // previous value must match rc = - KeyValue::Set(kTakeoverRequestKeyname, takeover_result, takeover_request); + KeyValue::Set(kTakeoverRequestKeyname, takeover_result, +takeover_request, takeover_valid_time); return rc; } diff --git a/src/osaf/consensus/key_value.cc b/src/osaf/consensus/key_value.cc index cf5c02213..109cf9f69 100644 --- a/src/osaf/consensus/key_value.cc +++ b/src/osaf/consensus/key_value.cc @@ -58,13 +58,14 @@ SaAisErrorT KeyValue::Get(const std::string& key, std::string& value) { } } -SaAisErrorT KeyValue::Set(const std::string& key, const std::string& value) { +SaAisErrorT KeyValue::Set(const std::string& key, const std::string& value, + const unsigned int timeout) { TRACE_ENTER(); const std::string kv_store_cmd = base::GetEnv("FMS_KEYVALUE_STORE_PLUGIN_CMD", ""); const std::string command(kv_store_cmd + " set \"" + key + "\" \"" + value + -"\""); +"\" " + std::to_string(timeout)); std::string output; int rc = KeyValue::Execute(command, output); @@ -76,11 +77,13 @@ SaAisErrorT KeyValue::Set(const std::string& key, const std::string& value) { } SaAisErrorT KeyValue::Set(const std::string& key, const std::string& value, - const std::string& prev_value) { + const std::string& prev_value, + const unsigned int timeout) { const std::string kv_store_cmd = base::GetEnv("FMS_KEYVALUE_STORE_PLUGIN_CMD", ""); const std::string command(kv_store_cmd + " set_if_prev \"" + key + "\" \"" + -value + "\" \"" + prev_value + "\""); +value + "\" \"" + prev_value + +"\" " + std::to_string(timeout)); std::string output; int rc = KeyValue::Execute(command, output); diff --git a/src/osaf/consensus/key_value.h b/src/osaf/consensus/key_value.h index af9717595..043a964ae 100644 --- a/src/osaf/consensus/key_value.h +++ b/src/osaf/consensus/key_value.h @@ -30,15 +30,17 @@ class KeyValue { static SaAisErrorT Get(const std::string& key, std::string& value); // Set key to value - static SaAisErrorT Set(const std::string& key, const std::string& value); + static SaAisErrorT Set(const std::string& key, const std::string& value, + const unsigned int timeout); // Set key to value only if prev value matches static SaAisErrorT Set(const std::string& key, const std::string& value, - const std::string& prev_value); + const std::string& prev_value, + const unsigned int timeout); // Create key, and set to value. Fails if key already exists. static SaAisErrorT Create(const std::string& key, const std::string& value, -const unsigned int timeout = 0); +const unsigned int timeout); // Erase key static
[devel] [PATCH 2/2] osaf: update etcd2 and sample plugins [#2954]
add timeout parameter to set and set_if_prev --- src/osaf/consensus/plugins/etcd.plugin | 20 src/osaf/consensus/plugins/sample.plugin | 16 ++-- 2 files changed, 22 insertions(+), 14 deletions(-) diff --git a/src/osaf/consensus/plugins/etcd.plugin b/src/osaf/consensus/plugins/etcd.plugin index b585bb657..f62cc892b 100644 --- a/src/osaf/consensus/plugins/etcd.plugin +++ b/src/osaf/consensus/plugins/etcd.plugin @@ -45,15 +45,17 @@ get() { # params: # $1 - # $2 - +# $3 - # returns: # 0 - success # non-zero - failure setkey() { readonly key="$1" readonly value="$2" + readonly timeout="$3" if etcdctl $etcd_options --timeout $etcd_timeout set "$directory$key" \ -"$value" >/dev/null +"$value" --ttl "$timeout" >/dev/null then return 0 else @@ -98,6 +100,7 @@ create_key() { # $1 - # $2 - # $3 - +# $4 - # returns: # 0 - success # non-zero - failure @@ -105,9 +108,10 @@ setkey_match_prev() { readonly key="$1" readonly value="$2" readonly prev="$3" + readonly timeout="$4" if etcdctl $etcd_options --timeout $etcd_timeout set "$directory$key" \ -"$value" --swap-with-value "$prev" >/dev/null +"$value" --swap-with-value "$prev" --ttl "$timeout" >/dev/null then return 0 else @@ -263,19 +267,19 @@ case "$1" in exit $? ;; set) -if [ "$#" -ne 3 ]; then - echo "Usage: $0 set " +if [ "$#" -ne 4 ]; then + echo "Usage: $0 set " exit 1 fi -setkey "$2" "$3" +setkey "$2" "$3" "$4" exit $? ;; set_if_prev) -if [ "$#" -ne 4 ]; then - echo "Usage: $0 set " +if [ "$#" -ne 5 ]; then + echo "Usage: $0 set" exit 1 fi -setkey_match_prev "$2" "$3" "$4" +setkey_match_prev "$2" "$3" "$4" "$5" exit $? ;; create) diff --git a/src/osaf/consensus/plugins/sample.plugin b/src/osaf/consensus/plugins/sample.plugin index 6f6c71f6f..fc4c54c17 100644 --- a/src/osaf/consensus/plugins/sample.plugin +++ b/src/osaf/consensus/plugins/sample.plugin @@ -35,12 +35,14 @@ get() { # params: # $1 - # $2 - +# $3 - # returns: # 0 - success # non-zero - failure setkey() { readonly key="$1" readonly value="$2" + readonly timeout="$3" ... } @@ -69,6 +71,7 @@ create_key() { # $1 - # $2 - # $3 - +# $4 - # returns: # 0 - success # non-zero - failure @@ -76,6 +79,7 @@ setkey_match_prev() { readonly key="$1" readonly value="$2" readonly prev="$3" + readonly timeout="$4" ... } @@ -155,19 +159,19 @@ case "$1" in exit $? ;; set) -if [ "$#" -ne 3 ]; then - echo "Usage: $0 set " +if [ "$#" -ne 4 ]; then + echo "Usage: $0 set " exit 1 fi -setkey "$2" "$3" +setkey "$2" "$3" "$4" exit $? ;; set_if_prev) -if [ "$#" -ne 4 ]; then - echo "Usage: $0 set " +if [ "$#" -ne 5 ]; then + echo "Usage: $0 set" exit 1 fi -setkey_match_prev "$2" "$3" "$4" +setkey_match_prev "$2" "$3" "$4" "$5" exit $? ;; create) -- 2.17.1 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] amfd: check consensus service is writable [#2957]
A check to make sure the consensus service is writable (ie. the SC is in a partition with quorum) is present in avd_node_failover(). However, [#2918] means this function is not always being called. We need to move it. --- src/amf/amfd/ndfsm.cc | 1 + src/amf/amfd/ndproc.cc | 10 +++--- src/amf/amfd/proc.h| 1 + 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/src/amf/amfd/ndfsm.cc b/src/amf/amfd/ndfsm.cc index 301de835b..218551cdd 100644 --- a/src/amf/amfd/ndfsm.cc +++ b/src/amf/amfd/ndfsm.cc @@ -817,6 +817,7 @@ void avd_mds_avnd_down_evh(AVD_CL_CB *cb, AVD_EVT *evt) { if (cb->node_failover_delay == 0) { avd_node_failover(node); } + check_quorum(); node->node_info.member = SA_FALSE; // Update standby out of sync if standby sc goes down if (avd_cb->node_id_avd_other == node->node_info.nodeId) { diff --git a/src/amf/amfd/ndproc.cc b/src/amf/amfd/ndproc.cc index 853a68b6e..c4eebb174 100644 --- a/src/amf/amfd/ndproc.cc +++ b/src/amf/amfd/ndproc.cc @@ -1242,6 +1242,12 @@ void avd_node_failover(AVD_AVND *node, const bool mw_only) { avd_node_down_appl_susi_failover(avd_cb, node); } + TRACE_LEAVE(); +} + +void check_quorum() { + TRACE_ENTER(); + Consensus consensus_service; if (consensus_service.IsRemoteFencingEnabled() == false && consensus_service.IsWritable() == false) { @@ -1250,6 +1256,4 @@ void avd_node_failover(AVD_AVND *node, const bool mw_only) { opensaf_reboot(0, nullptr, "Quorum lost. Rebooting this node to prevent split-brain"); } - - TRACE_LEAVE(); -} +} \ No newline at end of file diff --git a/src/amf/amfd/proc.h b/src/amf/amfd/proc.h index 99d1cbfc2..a37821829 100644 --- a/src/amf/amfd/proc.h +++ b/src/amf/amfd/proc.h @@ -96,6 +96,7 @@ void avd_process_hb_event(AVD_CL_CB *cb_now, struct AVD_EVT *evt); extern void avd_node_mark_absent(AVD_AVND *node); extern void avd_tmr_snd_hb_evh(AVD_CL_CB *cb, AVD_EVT *evt); extern void avd_node_failover(AVD_AVND *node, const bool mw_only = false); +extern void check_quorum(); extern AVD_SU *get_other_su_from_oper_list(AVD_SU *su); extern void su_complete_admin_op(AVD_SU *su, SaAisErrorT result); extern void comp_complete_admin_op(AVD_COMP *comp, SaAisErrorT result); -- 2.17.1 ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/1] Review Request for amfd: check consensus service is writable [#2957]
Summary: amfd: check consensus service is writable [#2957] Review request for Ticket(s): 2957 Peer Reviewer(s): Hans, Nagu, Minh Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-2957 Base revision: 5bb2174a323a97f626ce354d553a1dc4d1673899 Personal repository: git://git.code.sf.net/u/userid-2226215/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesn Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - revision 34dc36653385a670198acdbc66f4b53699524696 Author: Gary Lee Date: Tue, 13 Nov 2018 06:23:52 + amfd: check consensus service is writable [#2957] A check to make sure the consensus service is writable (ie. the SC is in a partition with quorum) is present in avd_node_failover(). However, [#2918] means this function is not always being called. We need to move it. Complete diffstat: -- src/amf/amfd/ndfsm.cc | 1 + src/amf/amfd/ndproc.cc | 10 +++--- src/amf/amfd/proc.h| 1 + 3 files changed, 9 insertions(+), 3 deletions(-) Testing Commands: - *** LIST THE COMMAND LINE TOOLS/STEPS TO TEST YOUR CHANGES *** Testing, Expected Results: -- *** PASTE COMMAND OUTPUTS / TEST RESULTS *** Conditions of Submission: - *** HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC *** Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] [PATCH 1/1] amf: active amfd should check nodes after reinit with imm [#2949]
Hi Thuan Ack with one comment (review only), I guess avd_check_nodes_after_renit_imn() and should be avd_check_nodes_after_reinit_imm()? I'll change it and push on your behalf. Thanks Gary On 12/11/18 7:25 pm, thuan.tran wrote: - When AMFD got IMM BAD_HANDLE, it will try to finalize current OI and reinit new OI, it make some callbacks are removed without execution. Try to dispatch OI before finalize it to reinit. - After reinit OI, check node db to find out node which is not exist in IMM (in case ccb apply delete node miss execution) then cleanup SU/COMP relate to the node and delete the node. --- src/amf/amfd/imm.cc | 4 +++ src/amf/amfd/node.cc | 80 src/amf/amfd/node.h | 1 + src/amf/amfd/su.cc | 79 +++ 4 files changed, 72 insertions(+), 92 deletions(-) diff --git a/src/amf/amfd/imm.cc b/src/amf/amfd/imm.cc index 82d2b13..1b14383 100644 --- a/src/amf/amfd/imm.cc +++ b/src/amf/amfd/imm.cc @@ -41,6 +41,7 @@ #include "amf/common/amf_defs.h" #include "amf/amfd/imm.h" #include "amf/amfd/cluster.h" +#include "amf/amfd/node.h" #include "amf/amfd/app.h" #include "amf/amfd/sgtype.h" #include "amf/amfd/sg.h" @@ -2132,6 +2133,8 @@ static void *avd_imm_reinit_bg_thread(void *_cb) { immutilWrapperProfile.errorsAreFatal = 0; + /* Try to dispatch imm cb if any, e.g: apply cb, before finalize OI */ + (void)saImmOiDispatch(avd_cb->immOiHandle, SA_DISPATCH_ALL); while (++no_of_retries < MAX_NO_RETRIES) { (void)saImmOiFinalize(avd_cb->immOiHandle); @@ -2167,6 +2170,7 @@ static void *avd_imm_reinit_bg_thread(void *_cb) { osaf_mutex_unlock_ordie(_reinit_mutex); exit(EXIT_FAILURE); } + avd_check_nodes_after_renit_imm(); } else { /* become applier and re-read the config */ rc = avd_imm_applier_set(); diff --git a/src/amf/amfd/node.cc b/src/amf/amfd/node.cc index 201f1fc..b6ebcc8 100644 --- a/src/amf/amfd/node.cc +++ b/src/amf/amfd/node.cc @@ -25,6 +25,7 @@ #include "amf/amfd/amfd.h" #include "amf/amfd/cluster.h" #include "amf/amfd/imm.h" +#include "amf/amfd/util.h" #include AmfDb *node_name_db = 0; /* SaNameT index */ @@ -153,57 +154,23 @@ AVD_AVND *avd_node_new(const std::string ) { void avd_node_delete(AVD_AVND *node) { TRACE_ENTER(); osafassert(node->pg_csi_list.n_nodes == 0); - if (node->node_info.nodeId) avd_node_delete_nodeid(node); - /* Check if the SUs and related objects are still left. This can - happen on Standby Amfd when it has just read the configuration - and before it becomes applier, Act Amfd deletes SUs. Those SUs - will be left out at Standby Amfd. Though this could be rare.*/ - if (avd_cb->avail_state_avd != SA_AMF_HA_ACTIVE) { -if (node->list_of_su.empty() != true) { - std::set su_list; - std::set comp_list; - for (const auto : node->list_of_su) su_list.insert(su->name); - for (std::set::const_iterator iter = su_list.begin(); - iter != su_list.end(); ++iter) { -AVD_SU *su = su_db->find(*iter); -TRACE("Standby Amfd, su '%s' not deleted", su->name.c_str()); -for (const auto : su->list_of_comp) - comp_list.insert(Amf::to_string(>comp_info.name)); -for (std::set::const_iterator iter1 = comp_list.begin(); - iter1 != comp_list.end(); ++iter1) { - AVD_COMP *comp = comp_db->find(*iter1); - TRACE("Standby Amfd, comp '%s' not deleted", -osaf_extended_name_borrow(>comp_info.name)); - - std::map::iterator it = - compcstype_db->begin(); - while (it != compcstype_db->end()) { -AVD_COMPCS_TYPE *compcstype = it->second; -if (compcstype->comp == comp) { - TRACE("Standby Amfd, compcstype '%s' not deleted", -compcstype->name.c_str()); - it = compcstype_db->erase(it); - delete compcstype; -} else - ++it; - } + if (node->node_info.nodeId) { +avd_node_delete_nodeid(node); + } - /* Delete the Comp. */ - struct CcbUtilOperationData opdata; - osaf_extended_name_alloc( - osaf_extended_name_borrow(>comp_info.name), - ); - comp_ccb_apply_delete_hdlr(); -} -comp_list.clear(); -/* Delete the SU. */ -struct CcbUtilOperationData opdata; -opdata.userData = su; -su_ccb_apply_delete_hdlr(); - } - su_list.clear(); -} + std::set su_list; + for (const auto : node->list_of_ncs_su) su_list.insert(su->name); + for (const auto : node->list_of_su) su_list.insert(su->name); + for (std::set::const_iterator iter = su_list.begin(); + iter != su_list.end(); ++iter) { +AVD_SU *su = su_db->find(*iter); +LOG_WA("su '%s' not deleted, delete it", su->name.c_str()); +struct
Re: [devel] Review Request for amf: Update PR [#2929]
Ok, thanks -Nagu -Original Message- From: Minh Hon Chau [mailto:minh.c...@dektech.com.au] Sent: 12 November 2018 17:51 To: Nagendra Kumar; 'Hans Nordeback'; 'Gary Lee' Cc: opensaf-devel@lists.sourceforge.net Subject: Re: Review Request for amf: Update PR [#2929] Hi Nagu, It's not the recovery in specification, I mean a new attribute. Thanks, Minh On 12/11/18 9:53 pm, Nagendra Kumar wrote: > Hi Minh, > Thanks for your response. >>> In future, I think we can make it as configurable recovery method, so up to >>> applications to choose from. > You mean recommended recovery option? But how will it work? > > > Thanks > -Nagendra > High Availability Solutions > www.hasolutions.in > cont...@hasolutions.in > Hyderabad, India: +91-9866424860 | Delaware, USA: +1 508-422-7725 > > -Original Message- > From: Minh Hon Chau [mailto:minh.c...@dektech.com.au] > Sent: 12 November 2018 16:16 > To: Nagendra Kumar; 'Hans Nordeback'; 'Gary Lee' > Cc: opensaf-devel@lists.sourceforge.net > Subject: Re: Review Request for amf: Update PR [#2929] > > Hi Nagu, > > Agree with you that we can do it for 2N. However the mutual active > workload has to be exclusively one at a time, so there may be some sort > of corruption to applications. But it also depends on how internal > application logics are implemented. So reboot the node is a choice of > safety for now. In future, I think we can make it as configurable > recovery method, so up to applications to choose from. > > Thanks, > > Minh > > On 12/11/18 7:49 pm, Nagendra Kumar wrote: >> Hi Minh, >> Ack from me. >> Btw, why did you opt to remove assignments and restart admin operation for >> Nway Act and No Red. >> The same could have done in 2N by removing the assignments and restart and >> then provide fresh assignments. >> >> Thanks >> -Nagendra >> High Availability Solutions >> www.hasolutions.in >> cont...@hasolutions.in >> Hyderabad, India: +91-9866424860 | Delaware, USA: +1 508-422-7725 >> >> -Original Message- >> From: Minh Hon Chau [mailto:minh.c...@dektech.com.au] >> Sent: 12 November 2018 08:04 >> To: Hans Nordeback; Nagendra Kumar; Gary Lee >> Cc: opensaf-devel@lists.sourceforge.net >> Subject: Review Request for amf: Update PR [#2929] >> >> Hi all, >> >> Document update for #2929 in item 2.2.18 to be reviewed. >> >> https://sourceforge.net/p/opensaf/tickets/2929/attachment/OpenSAF_AMF_PR_2929.odt >> >> Thanks, >> >> Minh >> >> >> > ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] Review Request for amf: Update PR [#2929]
Hi Nagu, It's not the recovery in specification, I mean a new attribute. Thanks, Minh On 12/11/18 9:53 pm, Nagendra Kumar wrote: Hi Minh, Thanks for your response. In future, I think we can make it as configurable recovery method, so up to applications to choose from. You mean recommended recovery option? But how will it work? Thanks -Nagendra High Availability Solutions www.hasolutions.in cont...@hasolutions.in Hyderabad, India: +91-9866424860 | Delaware, USA: +1 508-422-7725 -Original Message- From: Minh Hon Chau [mailto:minh.c...@dektech.com.au] Sent: 12 November 2018 16:16 To: Nagendra Kumar; 'Hans Nordeback'; 'Gary Lee' Cc: opensaf-devel@lists.sourceforge.net Subject: Re: Review Request for amf: Update PR [#2929] Hi Nagu, Agree with you that we can do it for 2N. However the mutual active workload has to be exclusively one at a time, so there may be some sort of corruption to applications. But it also depends on how internal application logics are implemented. So reboot the node is a choice of safety for now. In future, I think we can make it as configurable recovery method, so up to applications to choose from. Thanks, Minh On 12/11/18 7:49 pm, Nagendra Kumar wrote: Hi Minh, Ack from me. Btw, why did you opt to remove assignments and restart admin operation for Nway Act and No Red. The same could have done in 2N by removing the assignments and restart and then provide fresh assignments. Thanks -Nagendra High Availability Solutions www.hasolutions.in cont...@hasolutions.in Hyderabad, India: +91-9866424860 | Delaware, USA: +1 508-422-7725 -Original Message- From: Minh Hon Chau [mailto:minh.c...@dektech.com.au] Sent: 12 November 2018 08:04 To: Hans Nordeback; Nagendra Kumar; Gary Lee Cc: opensaf-devel@lists.sourceforge.net Subject: Review Request for amf: Update PR [#2929] Hi all, Document update for #2929 in item 2.2.18 to be reviewed. https://sourceforge.net/p/opensaf/tickets/2929/attachment/OpenSAF_AMF_PR_2929.odt Thanks, Minh ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] Review Request for amf: Update PR [#2929]
Hi Minh, Thanks for your response. >> In future, I think we can make it as configurable recovery method, so up to >> applications to choose from. You mean recommended recovery option? But how will it work? Thanks -Nagendra High Availability Solutions www.hasolutions.in cont...@hasolutions.in Hyderabad, India: +91-9866424860 | Delaware, USA: +1 508-422-7725 -Original Message- From: Minh Hon Chau [mailto:minh.c...@dektech.com.au] Sent: 12 November 2018 16:16 To: Nagendra Kumar; 'Hans Nordeback'; 'Gary Lee' Cc: opensaf-devel@lists.sourceforge.net Subject: Re: Review Request for amf: Update PR [#2929] Hi Nagu, Agree with you that we can do it for 2N. However the mutual active workload has to be exclusively one at a time, so there may be some sort of corruption to applications. But it also depends on how internal application logics are implemented. So reboot the node is a choice of safety for now. In future, I think we can make it as configurable recovery method, so up to applications to choose from. Thanks, Minh On 12/11/18 7:49 pm, Nagendra Kumar wrote: > Hi Minh, > Ack from me. > Btw, why did you opt to remove assignments and restart admin operation for > Nway Act and No Red. > The same could have done in 2N by removing the assignments and restart and > then provide fresh assignments. > > Thanks > -Nagendra > High Availability Solutions > www.hasolutions.in > cont...@hasolutions.in > Hyderabad, India: +91-9866424860 | Delaware, USA: +1 508-422-7725 > > -Original Message- > From: Minh Hon Chau [mailto:minh.c...@dektech.com.au] > Sent: 12 November 2018 08:04 > To: Hans Nordeback; Nagendra Kumar; Gary Lee > Cc: opensaf-devel@lists.sourceforge.net > Subject: Review Request for amf: Update PR [#2929] > > Hi all, > > Document update for #2929 in item 2.2.18 to be reviewed. > > https://sourceforge.net/p/opensaf/tickets/2929/attachment/OpenSAF_AMF_PR_2929.odt > > Thanks, > > Minh > > > ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] Review Request for amf: Update PR [#2929]
Hi Nagu, Agree with you that we can do it for 2N. However the mutual active workload has to be exclusively one at a time, so there may be some sort of corruption to applications. But it also depends on how internal application logics are implemented. So reboot the node is a choice of safety for now. In future, I think we can make it as configurable recovery method, so up to applications to choose from. Thanks, Minh On 12/11/18 7:49 pm, Nagendra Kumar wrote: Hi Minh, Ack from me. Btw, why did you opt to remove assignments and restart admin operation for Nway Act and No Red. The same could have done in 2N by removing the assignments and restart and then provide fresh assignments. Thanks -Nagendra High Availability Solutions www.hasolutions.in cont...@hasolutions.in Hyderabad, India: +91-9866424860 | Delaware, USA: +1 508-422-7725 -Original Message- From: Minh Hon Chau [mailto:minh.c...@dektech.com.au] Sent: 12 November 2018 08:04 To: Hans Nordeback; Nagendra Kumar; Gary Lee Cc: opensaf-devel@lists.sourceforge.net Subject: Review Request for amf: Update PR [#2929] Hi all, Document update for #2929 in item 2.2.18 to be reviewed. https://sourceforge.net/p/opensaf/tickets/2929/attachment/OpenSAF_AMF_PR_2929.odt Thanks, Minh ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
Re: [devel] Review Request for amf: Update PR [#2929]
Hi Minh, Ack from me. Btw, why did you opt to remove assignments and restart admin operation for Nway Act and No Red. The same could have done in 2N by removing the assignments and restart and then provide fresh assignments. Thanks -Nagendra High Availability Solutions www.hasolutions.in cont...@hasolutions.in Hyderabad, India: +91-9866424860 | Delaware, USA: +1 508-422-7725 -Original Message- From: Minh Hon Chau [mailto:minh.c...@dektech.com.au] Sent: 12 November 2018 08:04 To: Hans Nordeback; Nagendra Kumar; Gary Lee Cc: opensaf-devel@lists.sourceforge.net Subject: Review Request for amf: Update PR [#2929] Hi all, Document update for #2929 in item 2.2.18 to be reviewed. https://sourceforge.net/p/opensaf/tickets/2929/attachment/OpenSAF_AMF_PR_2929.odt Thanks, Minh ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 0/1] Review Request for amf: active amfd should check nodes after reinit with imm [#2949] V4
Summary: amf: active amfd should check nodes after reinit with imm [#2949] Review request for Ticket(s): 2949 Peer Reviewer(s): Gary, Minh Pull request to: *** LIST THE PERSON WITH PUSH ACCESS HERE *** Affected branch(es): develop Development branch: ticket-2949 Base revision: f8a6848a1cdbff0b518c3db951e4689e260226c7 Personal repository: git://git.code.sf.net/u/thuantr/review Impacted area Impact y/n Docsn Build systemn RPM/packaging n Configuration files n Startup scripts n SAF servicesy OpenSAF servicesn Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): - N/A revision b977065a3257758e7b043d60174487504dfba70c Author: thuan.tran Date: Mon, 12 Nov 2018 15:17:54 +0700 amf: active amfd should check nodes after reinit with imm [#2949] - When AMFD got IMM BAD_HANDLE, it will try to finalize current OI and reinit new OI, it make some callbacks are removed without execution. Try to dispatch OI before finalize it to reinit. - After reinit OI, check node db to find out node which is not exist in IMM (in case ccb apply delete node miss execution) then cleanup SU/COMP relate to the node and delete the node. Complete diffstat: -- src/amf/amfd/imm.cc | 4 +++ src/amf/amfd/node.cc | 80 src/amf/amfd/node.h | 1 + src/amf/amfd/su.cc | 79 +++ 4 files changed, 72 insertions(+), 92 deletions(-) Testing Commands: - N/A Testing, Expected Results: -- N/A Conditions of Submission: - ACK by reviewers Arch Built StartedLinux distro --- mipsn n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: --- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ___ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel
[devel] [PATCH 1/1] amf: active amfd should check nodes after reinit with imm [#2949]
- When AMFD got IMM BAD_HANDLE, it will try to finalize current OI and reinit new OI, it make some callbacks are removed without execution. Try to dispatch OI before finalize it to reinit. - After reinit OI, check node db to find out node which is not exist in IMM (in case ccb apply delete node miss execution) then cleanup SU/COMP relate to the node and delete the node. --- src/amf/amfd/imm.cc | 4 +++ src/amf/amfd/node.cc | 80 src/amf/amfd/node.h | 1 + src/amf/amfd/su.cc | 79 +++ 4 files changed, 72 insertions(+), 92 deletions(-) diff --git a/src/amf/amfd/imm.cc b/src/amf/amfd/imm.cc index 82d2b13..1b14383 100644 --- a/src/amf/amfd/imm.cc +++ b/src/amf/amfd/imm.cc @@ -41,6 +41,7 @@ #include "amf/common/amf_defs.h" #include "amf/amfd/imm.h" #include "amf/amfd/cluster.h" +#include "amf/amfd/node.h" #include "amf/amfd/app.h" #include "amf/amfd/sgtype.h" #include "amf/amfd/sg.h" @@ -2132,6 +2133,8 @@ static void *avd_imm_reinit_bg_thread(void *_cb) { immutilWrapperProfile.errorsAreFatal = 0; + /* Try to dispatch imm cb if any, e.g: apply cb, before finalize OI */ + (void)saImmOiDispatch(avd_cb->immOiHandle, SA_DISPATCH_ALL); while (++no_of_retries < MAX_NO_RETRIES) { (void)saImmOiFinalize(avd_cb->immOiHandle); @@ -2167,6 +2170,7 @@ static void *avd_imm_reinit_bg_thread(void *_cb) { osaf_mutex_unlock_ordie(_reinit_mutex); exit(EXIT_FAILURE); } + avd_check_nodes_after_renit_imm(); } else { /* become applier and re-read the config */ rc = avd_imm_applier_set(); diff --git a/src/amf/amfd/node.cc b/src/amf/amfd/node.cc index 201f1fc..b6ebcc8 100644 --- a/src/amf/amfd/node.cc +++ b/src/amf/amfd/node.cc @@ -25,6 +25,7 @@ #include "amf/amfd/amfd.h" #include "amf/amfd/cluster.h" #include "amf/amfd/imm.h" +#include "amf/amfd/util.h" #include AmfDb *node_name_db = 0; /* SaNameT index */ @@ -153,57 +154,23 @@ AVD_AVND *avd_node_new(const std::string ) { void avd_node_delete(AVD_AVND *node) { TRACE_ENTER(); osafassert(node->pg_csi_list.n_nodes == 0); - if (node->node_info.nodeId) avd_node_delete_nodeid(node); - /* Check if the SUs and related objects are still left. This can - happen on Standby Amfd when it has just read the configuration - and before it becomes applier, Act Amfd deletes SUs. Those SUs - will be left out at Standby Amfd. Though this could be rare.*/ - if (avd_cb->avail_state_avd != SA_AMF_HA_ACTIVE) { -if (node->list_of_su.empty() != true) { - std::set su_list; - std::set comp_list; - for (const auto : node->list_of_su) su_list.insert(su->name); - for (std::set::const_iterator iter = su_list.begin(); - iter != su_list.end(); ++iter) { -AVD_SU *su = su_db->find(*iter); -TRACE("Standby Amfd, su '%s' not deleted", su->name.c_str()); -for (const auto : su->list_of_comp) - comp_list.insert(Amf::to_string(>comp_info.name)); -for (std::set::const_iterator iter1 = comp_list.begin(); - iter1 != comp_list.end(); ++iter1) { - AVD_COMP *comp = comp_db->find(*iter1); - TRACE("Standby Amfd, comp '%s' not deleted", -osaf_extended_name_borrow(>comp_info.name)); - - std::map::iterator it = - compcstype_db->begin(); - while (it != compcstype_db->end()) { -AVD_COMPCS_TYPE *compcstype = it->second; -if (compcstype->comp == comp) { - TRACE("Standby Amfd, compcstype '%s' not deleted", -compcstype->name.c_str()); - it = compcstype_db->erase(it); - delete compcstype; -} else - ++it; - } + if (node->node_info.nodeId) { +avd_node_delete_nodeid(node); + } - /* Delete the Comp. */ - struct CcbUtilOperationData opdata; - osaf_extended_name_alloc( - osaf_extended_name_borrow(>comp_info.name), - ); - comp_ccb_apply_delete_hdlr(); -} -comp_list.clear(); -/* Delete the SU. */ -struct CcbUtilOperationData opdata; -opdata.userData = su; -su_ccb_apply_delete_hdlr(); - } - su_list.clear(); -} + std::set su_list; + for (const auto : node->list_of_ncs_su) su_list.insert(su->name); + for (const auto : node->list_of_su) su_list.insert(su->name); + for (std::set::const_iterator iter = su_list.begin(); + iter != su_list.end(); ++iter) { +AVD_SU *su = su_db->find(*iter); +LOG_WA("su '%s' not deleted, delete it", su->name.c_str()); +struct CcbUtilOperationData opdata; +opdata.userData = su; +su_ccb_apply_delete_hdlr(); } + su_list.clear(); + m_AVSV_SEND_CKPT_UPDT_ASYNC_RMV(avd_cb, node, AVSV_CKPT_AVD_NODE_CONFIG); node_name_db->erase(node->name); delete node; @@ -1678,3 +1645,18 @@ bool