> > > > > > }
> > > > > > handlers {
> > > > > > fence-peer "/usr/lib/drbd/rhcs_fence";
> > > > > > }
> > > > > > }
> > > > > >
> > > > > >
> > > > > rhcs_fence is wrong fence-peer utility. You should use
> > > > > /usr/lib/drbd/crm-fence-peer.sh and
> > > > > /usr/lib/drbd/crm-unfence-peer.sh instead.
> > > >
> > > > But my understanging (probably wrong) was that the fence-peer handler is
> > > > meant to be called for STONITH, not for "usual" promotions/demotions
> > > > to/from Primary/Secondary.
> > > >
> > > > If I use the aforementioned pair of handlers (crm-*.sh) for
> > > > fence/unfence, do I still get STONITH behavior for "split brain cases"?
> > > >
> > > 
> > > Correct. The 'rhcs_fence' handler passes fence calls on to cman, which 
> > > you have set to redirect on to pacemaker. This isn't what it was 
> > > designed for, and hasn't been tested. It was meant to be an updated 
> > > replacement for obliterate-peer.sh in cman+rgmanager clusters directly 
> > > (no pacemaker).
> > 
> > Well, since it is a CMAN cluster after all and rhcs_fence relies only 
> > (besides /proc/drbd) on cman_tool and fence_node (which should be correctly 
> > working), I thought it would be the correct fence script choice, but I will 
> > obviously accept your suggestion and use the crm-* scripts instead.
> > 
> > Anyway, I'm afraid that the real problem lurks elsewhere, since, as I 
> > stated before, a simple master/slave promotion/demotion should not lead to 
> > fencing, I suppose.
> > 
> > As suggested by Nikita Staroverov , I pasted relevant (I hope) excerpts 
> > from logs on first node (the one surviving the stonith) at the time of one 
> > "stonith fest" :) just after committing a CIB update with new resources.
> > 
> > http://pastebin.com/0eQqsftb
> > 
> > I can recall that seconds before being shot, the second node "lost contact" 
> > with cluster (I was issuing "pcs status" and "crm_mon -Arf1" from an SSH 
> > session and suddenly it went "cluster not connected" or something like 
> > that).
> 
> Yep, thats consistent with:
>  
> Jul  2 21:32:38 cluster1 pengine[16342]:  warning: pe_fence_node: Node 
> cluster2.verolengo.privatelan will be fenced because the node is no longer 
> part of the cluster
> 
> > 
> > Maybe (apart from the aforementioned improper use of rhcs_fence) there are 
> > issues with some timeout settings on cluster/DRBD operations and almost 
> > certainly the nodes have problems with their clock (still finding a 
> > reasonable/reachable NTP source), but I do not know if these can be 
> > relevant issues.
> 
> Or, its _because_ of the improper use of rhcs_fence... depending on how it 
> works it could be telling corosync/cman on cluster2 to disappear.

I can confirm that changing handler scripts to the suggested crm-*.sh has 
resolved all the problems.

I still think that at least this page:

http://www.drbd.org/users-guide-8.4/s-fence-peer.html#idp68019024

clearly states that no fence-peer handler should be invoked under regular 
operations, but this is more of a documentation (and drbd-users list) affair.

Many thanks again to you all for the assistance!

Regards,
Giuseppe
                                          
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to