On 2013-03-19 16:02, Fredrik Hudner wrote: > Just wanted to change what document it*s been built from.. It should be > "LINBIT DRBD 8.4 Configuration Guide: NFS on RHEL 6
There is again that fencing-constraint in your configuration .... what does "drbdadm dump all" look like? Any chance you only specified a fence-peer handler in you resource configuration but don't overwrite that after-resync-target handler you specified in your global_common.conf ... that would explain that dangling constraint that will prevent a failover. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now > > ---------- Forwarded message ---------- > From: Fredrik Hudner <fredrik.hud...@gmail.com> > Date: Mon, Mar 18, 2013 at 11:06 AM > Subject: Re: [Linux-HA] Problem promoting slave to master > To: General Linux-HA mailing list <linux-ha@lists.linux-ha.org> > > > > > On Fri, Mar 15, 2013 at 1:04 AM, Andreas Kurz <andr...@hastexo.com> wrote: > >> On 2013-03-14 15:52, Fredrik Hudner wrote: >>> I set no-quorum-policy to ignore and removed the constraint you >> mentioned. >>> It then managed to failover once to the slave node, but I still have >> those. >>> >>> Failed actions: >>> >>> p_exportfs_root:0_monitor_ >>>> >>>> 30000 (node=testclu01, call=12, rc=7, >>>> status=complete): not running >>>> >>>> p_exportfs_root:1_monitor_30000 (node=testclu02, call=12, rc=7, >>>> status=complete): not running >> >> This only tells you that monitoring of these resources found them once >> not running .... logs should tell you what & when that happens >> > > I have attached the logs from master and slave.. I can see that it stops, > but not really why (to limited knowledge to read the logs) > >> >>> >>> I then stoped the new maste-node to see if it fell over to the other node >>> with no success.. It remains slave. >> >> Hard to say without seeing current cluster state like a "crm_mon -1frA", >> "cat /proc/drbd" and some logs ... not enough information ... >> >> I have attached the output from crm_mon, the new crm configure and > /proc/drbd > > >>> I also noticed that the constraint drbd-fence-by-handler-nfs-ms_drbd_nfs >>> was back in the crm configure. Seems like cib makes a replace >> >> This constraint is added by the DRBD primary if it looses connection to >> its peer and is perfectly fine if you stopped one node. >> >> Seems like the cluster have a problem attaching to the cluster node ip, > but I'm not sure why > > i would like to add, that I took over this configuration from a guy that > has left, but I know that it's configured by using the technical > documentation from LINBIT "Highly available NFS storage with DRBD and > Pacemaker". > >> >>> Mar 14 15:06:18 [1786] tdtestclu02 crmd: info: >>> abort_transition_graph: te_update_diff:126 - Triggered transition >>> abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.781.1) : >> Non-status >>> change >>> Mar 14 15:06:18 [1786] tdtestclu02 crmd: notice: >>> do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ >>> input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] >>> Mar 14 15:06:18 [1781] tdtestclu02 cib: info: >>> cib_replace_notify: Replaced: 0.780.39 -> 0.781.1 from tdtestclu01 >>> >>> So not sure how to remove that constraint on a permanent basis.. it's >> gone >>> as long as I don't stop pacemaker. >> >> Once the DRBD resync is finished it will be removed from the cluster >> configuration again automatically... you typically never need to remove >> such drbd-fence-constraints manually only in some rare failure scenarios. >> >> Regards, >> Andreas >> >> >>> >>> But it used to work booth with the no-quorom-policy=freeze and that >>> constraint >>> >>> Kind regards >>> /Fredrik >>> >>> >>> >>> On Thu, Mar 14, 2013 at 2:49 PM, Andreas Kurz <andr...@hastexo.com> >> wrote: >>> >>>> On 2013-03-14 13:30, Fredrik Hudner wrote: >>>>> Hi all, >>>>> >>>>> I have a problem after I removed a node with the force command from my >>>> crm >>>>> config. >>>>> >>>>> Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6, >>>>> pacemaker 1.1.7-6.el6) >>>>> >>>>> >>>>> >>>>> Then I wanted to add a third node acting as quorum node, but was not >> able >>>>> to get it to work (probably because I don’t understand how to set it >> up). >>>>> >>>>> So I removed the 3rd node, but had to use the force command as crm >>>>> complained when I tried to remove it. >>>>> >>>>> >>>>> >>>>> Now when I start up Pacemaker the resources doesn’t look like they come >>>> up >>>>> correctly >>>>> >>>>> >>>>> >>>>> Online: [ testclu01 testclu02 ] >>>>> >>>>> >>>>> >>>>> Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] >>>>> >>>>> Masters: [ testclu01 ] >>>>> >>>>> Slaves: [ testclu02 ] >>>>> >>>>> Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver] >>>>> >>>>> Started: [ tdtestclu01 tdtestclu02 ] >>>>> >>>>> Resource Group: g_nfs >>>>> >>>>> p_lvm_nfs (ocf::heartbeat:LVM): Started testclu01 >>>>> >>>>> p_fs_shared (ocf::heartbeat:Filesystem): Started >> testclu01 >>>>> >>>>> p_fs_shared2 (ocf::heartbeat:Filesystem): Started >> testclu01 >>>>> >>>>> p_ip_nfs (ocf::heartbeat:IPaddr2): Started testclu01 >>>>> >>>>> Clone Set: cl_exportfs_root [p_exportfs_root] >>>>> >>>>> Started: [ testclu01 testclu02 ] >>>>> >>>>> >>>>> >>>>> Failed actions: >>>>> >>>>> p_exportfs_root:0_monitor_30000 (node=testclu01, call=12, rc=7, >>>>> status=complete): not running >>>>> >>>>> p_exportfs_root:1_monitor_30000 (node=testclu02, call=12, rc=7, >>>>> status=complete): not running >>>>> >>>>> >>>>> >>>>> The filesystems mount correctly on the master at this stage and can be >>>>> written to. >>>>> >>>>> When I stop the services on the master node for it to failover, it >>>> doesn’t >>>>> work.. Looses cluster-ip connectivity >>>> >>>> fix your "no-quorum-policy", you want to "ignore" the quorum in a >>>> two-node cluster to allow failover ... and if your drbd device is >>>> already in sync, remove that drbd-fence-by-handler-nfs-ms_drbd_nfs >>>> constraint. >>>> >>>> Regards, >>>> Andreas >>>> >>>> -- >>>> Need help with Pacemaker? >>>> http://www.hastexo.com/now >>>> >>>>> >>>>> >>>>> >>>>> Corosync.log from master after I stopped pacemaker on master node : >> see >>>>> attached file >>>>> >>>>> >>>>> >>>>> Additional files (attached): crm-configure show >>>>> >>>>> Corosync.conf >>>>> >>>>> >>>> Global_common.conf >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> I’m not sure how to proceed to get it up in a fair state now >>>>> >>>>> So if anyone could help me it would be much appreciated >>>>> >>>>> >>>>> >>>>> Kind regards >>>>> >>>>> /Fredrik >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Linux-HA mailing list >>>>> Linux-HA@lists.linux-ha.org >>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>> See also: http://linux-ha.org/ReportingProblems >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Linux-HA mailing list >>>> Linux-HA@lists.linux-ha.org >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>> See also: http://linux-ha.org/ReportingProblems >>>> >>> _______________________________________________ >>> Linux-HA mailing list >>> Linux-HA@lists.linux-ha.org >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>> See also: http://linux-ha.org/ReportingProblems >>> >> >> >> _______________________________________________ >> Linux-HA mailing list >> Linux-HA@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems >> > > > > > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems