On 4 Jul 2014, at 1:50 am, Giuseppe Ragusa <giuseppe.rag...@hotmail.com> wrote:
> > > > > } > > > > > handlers { > > > > > fence-peer "/usr/lib/drbd/rhcs_fence"; > > > > > } > > > > > } > > > > > > > > > > > > > > rhcs_fence is wrong fence-peer utility. You should use > > > > /usr/lib/drbd/crm-fence-peer.sh and > > > > /usr/lib/drbd/crm-unfence-peer.sh instead. > > > > > > But my understanging (probably wrong) was that the fence-peer handler is > > > meant to be called for STONITH, not for "usual" promotions/demotions > > > to/from Primary/Secondary. > > > > > > If I use the aforementioned pair of handlers (crm-*.sh) for > > > fence/unfence, do I still get STONITH behavior for "split brain cases"? > > > > > > > Correct. The 'rhcs_fence' handler passes fence calls on to cman, which > > you have set to redirect on to pacemaker. This isn't what it was > > designed for, and hasn't been tested. It was meant to be an updated > > replacement for obliterate-peer.sh in cman+rgmanager clusters directly > > (no pacemaker). > > Well, since it is a CMAN cluster after all and rhcs_fence relies only > (besides /proc/drbd) on cman_tool and fence_node (which should be correctly > working), I thought it would be the correct fence script choice, but I will > obviously accept your suggestion and use the crm-* scripts instead. > > Anyway, I'm afraid that the real problem lurks elsewhere, since, as I stated > before, a simple master/slave promotion/demotion should not lead to fencing, > I suppose. > > As suggested by Nikita Staroverov , I pasted relevant (I hope) excerpts from > logs on first node (the one surviving the stonith) at the time of one > "stonith fest" :) just after committing a CIB update with new resources. > > http://pastebin.com/0eQqsftb > > I can recall that seconds before being shot, the second node "lost contact" > with cluster (I was issuing "pcs status" and "crm_mon -Arf1" from an SSH > session and suddenly it went "cluster not connected" or something like that). Yep, thats consistent with: Jul 2 21:32:38 cluster1 pengine[16342]: warning: pe_fence_node: Node cluster2.verolengo.privatelan will be fenced because the node is no longer part of the cluster > > Maybe (apart from the aforementioned improper use of rhcs_fence) there are > issues with some timeout settings on cluster/DRBD operations and almost > certainly the nodes have problems with their clock (still finding a > reasonable/reachable NTP source), but I do not know if these can be relevant > issues. Or, its _because_ of the improper use of rhcs_fence... depending on how it works it could be telling corosync/cman on cluster2 to disappear. > > Many thanks again for your suggestions. > > Regards, > Giuseppe > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org