Hi,

When using corosync + pcmk v1 starting both corosync and pacemakerd (and
I think also using heartbeat or anything other than cman) as quorum
provider, at startup in the CIB will not be a <node_state/> entry for
the nodes that are not in cluster.

Instead when using cman as quorum provider there will be a <node_state>
for every node known by cman as lib/common/ais.c:cman_event_callback
calls crm_update_peer for every node reported by cman_get_nodes.

Something similar will happen when using corosync+pcmkv1 if corosync is
started on N nodes but pacemakerd is started only on N-M nodes.

All of this will break 'startup-fencing' because, from my understanding,
the logic is this:

1) At startup all the nodes are marked (in
lib/pengine/unpack.c:unpack_node) as unclean.
2) lib/pengine/unpack.c:unpack_status will cycle only the available
<node_state/> in the cib status section resetting them to a clean status
at the start and then putting them as unclean if some conditions are met.
3) pengine/allocate.c:stage6 all the unclean nodes are fenced.

In the above conditions you'll have a <node_state/> in the cib status
section also for nodes without pacemakerd enabled and the startup
fencing won't happen because there isn't any condition in unpack_status
that will mark them as unclean.


I'm not very expert of the code. I discarded the solution to not
register at startup all the nodes known by cman but only the active ones
as it won't fix the corosync+pcmkv1 case.

Instead I tried to understand when a node that has its status in the cib
should be startup fenced and a possible solution is in the attached patch.
I noticed that when crm_update_peer inserts a new node this one doesn't
have the expected attribute set. So if startup-fencing is enabled I'm
going to set the node as expected up.


Thanks!
Bye!

-- 
Simone Gotti

# HG changeset patch
# User Simone Gotti <simone.go...@gmail.com>
# Date 1300401753 -3600
# Node ID 30d64eaba0506e3ed85f442fd90ea3adc83c9501
# Parent  c925540f579c8b4ed0fcce1a1497346dc1f6ff86
Try to fix startup-fencing not happening when nodes without pacemakerd enabled 
have their node_state registered in the CIB.
This will happen with CMAN as quorum provider as all the nodes known to cman 
are registered at startup and also with corosync+pcmkv1
if corosync is started on N nodes but pacemakerd is started only on N-M nodes.

diff -r c925540f579c -r 30d64eaba050 lib/pengine/unpack.c
--- a/lib/pengine/unpack.c      Mon Mar 14 18:21:02 2011 +0100
+++ b/lib/pengine/unpack.c      Thu Mar 17 23:42:33 2011 +0100
@@ -693,6 +693,15 @@
     gboolean online = FALSE;
     const char *shutdown = NULL;
     const char *exp_state = crm_element_value(node_state, 
XML_CIB_ATTR_EXPSTATE);
+
+    gboolean unseen_are_unclean = TRUE;
+    const char *blind_faith = pe_pref(
+       data_set->config_hash, "startup-fencing");
+       
+    if(crm_is_true(blind_faith) == FALSE) {
+       unseen_are_unclean = FALSE;
+       crm_warn("Blind faith: not fencing unseen nodes");
+    }
        
     if(this_node == NULL) {
        crm_config_err("No node to check");
@@ -709,6 +718,13 @@
     } else if(safe_str_eq(exp_state, CRMD_JOINSTATE_MEMBER)) {
        this_node->details->expected_up = TRUE;
     }
+
+    /* A node can be in the status section of the cib because reported by the 
quorum provider.
+     * In this case the expected attribute isn't setted.
+     * Consider the node as expected up if startup-fencing is true */
+    if(exp_state == NULL && unseen_are_unclean == TRUE) {
+        this_node->details->expected_up = TRUE;
+    }
        
     if(is_set(data_set->flags, pe_flag_stonith_enabled) == FALSE) {
        online = determine_online_status_no_fencing(
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to