Re: [Pacemaker] [RFC PATCH] Try to fix startup-fencing not happening

Simone Gotti Fri, 18 Mar 2011 18:35:58 -0700

On 03/17/2011 11:54 PM, Simone Gotti wrote:
> Hi,
>
> When using corosync + pcmk v1 starting both corosync and pacemakerd (and
> I think also using heartbeat or anything other than cman) as quorum
> provider, at startup in the CIB will not be a <node_state/> entry for
> the nodes that are not in cluster.
>
> Instead when using cman as quorum provider there will be a <node_state>
> for every node known by cman as lib/common/ais.c:cman_event_callback
> calls crm_update_peer for every node reported by cman_get_nodes.
>
> Something similar will happen when using corosync+pcmkv1 if corosync is
> started on N nodes but pacemakerd is started only on N-M nodes.
>
> All of this will break 'startup-fencing' because, from my understanding,
> the logic is this:
>
> 1) At startup all the nodes are marked (in
> lib/pengine/unpack.c:unpack_node) as unclean.
> 2) lib/pengine/unpack.c:unpack_status will cycle only the available
> <node_state/> in the cib status section resetting them to a clean status
> at the start and then putting them as unclean if some conditions are met.
> 3) pengine/allocate.c:stage6 all the unclean nodes are fenced.
>
> In the above conditions you'll have a <node_state/> in the cib status
> section also for nodes without pacemakerd enabled and the startup
> fencing won't happen because there isn't any condition in unpack_status
> that will mark them as unclean.
>
>
> I'm not very expert of the code. I discarded the solution to not
> register at startup all the nodes known by cman but only the active ones
> as it won't fix the corosync+pcmkv1 case.
>
> Instead I tried to understand when a node that has its status in the cib
> should be startup fenced and a possible solution is in the attached patch.
> I noticed that when crm_update_peer inserts a new node this one doesn't
> have the expected attribute set. So if startup-fencing is enabled I'm
> going to set the node as expected up.
Hi,


Thinking a little more about this I think that the cman case and the
pcmkv1 case are quite different.

It's probably correct to have cman + pacemaker started on some nodes and
only cman started on other nodes.

So it would be better, as a first step, to make the cman integration
work as the other cases and then look at some problems already presents
in all the implementations that comes to my mind (I've got some corner
cases in mind that I'd like to explain in the next days).

The attached patch tries to add at startup to the cib status section
only the active nodes.

Thanks!
Bye!

>
> Thanks!
> Bye!
>

# HG changeset patch
# User Simone Gotti <simone.go...@gmail.com>
# Date 1300498033 -3600
# Node ID 1152982cac5558fea2faf5e344e76ac18d0b80c5
# Parent  30d64eaba0506e3ed85f442fd90ea3adc83c9501
At startup add only the active nodes. This will make the cman integration 
behave as the other and let startup-fencing work.

diff -r 30d64eaba050 -r 1152982cac55 lib/common/ais.c
--- a/lib/common/ais.c  Thu Mar 17 23:42:33 2011 +0100
+++ b/lib/common/ais.c  Sat Mar 19 02:27:13 2011 +0100
@@ -636,7 +636,7 @@
 
 #define MAX_NODES 256
 
-static void cman_event_callback(cman_handle_t handle, void *privdata, int 
reason, int arg)
+static void cman_event_handle(cman_handle_t handle, void *privdata, int 
reason, int arg, int startup)
 {
     int rc = 0, lpc = 0, node_count = 0;
 
@@ -674,10 +674,13 @@
                    /* Never allow node ID 0 to be considered a member #315711 
*/
                    cman_nodes[lpc].cn_member = 0;
                }
-               crm_update_peer(cman_nodes[lpc].cn_nodeid, 
cman_nodes[lpc].cn_incarnation,
+               /* At startup add only the active nodes or startup fencing 
won't work */
+               if ((startup && cman_nodes[lpc].cn_member) || !startup ) {
+                   crm_update_peer(cman_nodes[lpc].cn_nodeid, 
cman_nodes[lpc].cn_incarnation,
                                cman_nodes[lpc].cn_member?crm_peer_seq:0, 0, 0,
                                cman_nodes[lpc].cn_name,   
cman_nodes[lpc].cn_name, NULL,
                                
cman_nodes[lpc].cn_member?CRM_NODE_MEMBER:CRM_NODE_LOST);
+                }
            }
 
            if(dispatch) {
@@ -696,6 +699,12 @@
            break;
     }
 }
+
+static void cman_event_callback(cman_handle_t handle, void *privdata, int 
reason, int arg)
+{
+    cman_event_handle(handle, privdata, reason, arg, FALSE);
+}
+
 #endif
 
 gboolean init_cman_connection(
@@ -729,8 +738,8 @@
     }
 
     /* Get the current membership state */
-    cman_event_callback(pcmk_cman_handle, dispatch, CMAN_REASON_STATECHANGE,
-                       cman_is_quorate(pcmk_cman_handle));
+    cman_event_handle(pcmk_cman_handle, dispatch, CMAN_REASON_STATECHANGE,
+                       cman_is_quorate(pcmk_cman_handle), TRUE);
 
     fd = cman_get_fd(pcmk_cman_handle);
     crm_debug("Adding fd=%d to mainloop", fd);

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] [RFC PATCH] Try to fix startup-fencing not happening

Reply via email to