Re: [Pacemaker] Resource ordering/colocating question (DRBD + LVM + FS)

2013-09-06 Thread Heikki Manninen
On 6.9.2013, at 10.24, Heikki Manninen  wrote:

>> 2) Enable logging and look out which node is the DC.
>> There in the logs you find many many informations showing
>> what is going on. Hint: Open a terminal session with an
>> opened tail -f logfile. Watch it while inserting commands.
>> You'll get used to it.
> 
> Seems that node #2 was the DC (also visible in the pcs status output). I have 
> looked at the logs all the time, just not yet too familiar with the contents 
> of pacemaker logging. Here's the thing that keeps repeating everytime those 
> LVM and FS resources stay in stopped state:
> 
> Sep  3 20:01:23 pgdbsrv02 pengine[1667]:   notice: LogActions: Start   
> LVM_vgdata01#011(pgdbsrv01.cl1.local - blocked)
> Sep  3 20:01:23 pgdbsrv02 pengine[1667]:   notice: LogActions: Start   
> FS_data01#011(pgdbsrv01.cl1.local - blocked)
> Sep  3 20:01:23 pgdbsrv02 pengine[1667]:   notice: LogActions: Start   
> LVM_vgdata02#011(pgdbsrv01.cl1.local - blocked)
> Sep  3 20:01:23 pgdbsrv02 pengine[1667]:   notice: LogActions: Start   
> FS_data02#011(pgdbsrv01.cl1.local - blocked)
> 
> So what does blocked mean here? Is it that the node #1 in this case is in 
> need of fencing/stonithing and thus being blocked or something else (I have a 
> backgroud in the RHCS/HACMP/LifeKeeper etc. world). No quorum policy is set 
> to ignore.

Some logs from the situation. Here's what I've done before these logs (the 
problem remains):

- STONITH configured and enabled (though the Fusion python agent is not 
working, but pacemaker trying to stonith would appear in logs anyway right?)
- LVM resources changed to "exclusive=false" to remove LVM tagging function 
(just to minimize moving part although the tagging seemed to work properly)
- cluster property migration-threshold=1 set (was rsc defaults) as per the 
earlier e-mail about error in the quick start documentation

Everything running on node #1, issue standby node #1:

Sep  4 00:39:27 pgdbsrv02 crmd[1858]:   notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Sep  4 00:39:27 pgdbsrv02 cib[1853]:   notice: cib:diff: Diff: --- 0.174.67
Sep  4 00:39:27 pgdbsrv02 cib[1853]:   notice: cib:diff: Diff: +++ 0.175.1 
14210c59954a4f036d168d60b7ed7baf
Sep  4 00:39:27 pgdbsrv02 cib[1853]:   notice: cib:diff: -- 
Sep  4 00:39:27 pgdbsrv02 cib[1853]:   notice: cib:diff: ++   
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Demote  
DRBD_data01:0#011(Master -> Stopped pgdbsrv01.cl1.local)
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Promote 
DRBD_data01:1#011(Slave -> Master pgdbsrv02.cl1.local)
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Demote  
DRBD_data02:0#011(Master -> Stopped pgdbsrv01.cl1.local)
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Promote 
DRBD_data02:1#011(Slave -> Master pgdbsrv02.cl1.local)
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Stop
LVM_vgdata01#011(Started pgdbsrv02.cl1.local)
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Stop
FS_data01#011(Started pgdbsrv02.cl1.local)
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Stop
LVM_vgdata02#011(Started pgdbsrv02.cl1.local)
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Stop
FS_data02#011(Started pgdbsrv02.cl1.local)
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: process_pe_message: 
Calculated Transition 3: /var/lib/pacemaker/pengine/pe-input-188.bz2
Sep  4 00:39:27 pgdbsrv02 crmd[1858]:   notice: run_graph: Transition 3 
(Complete=8, Pending=0, Fired=0, Skipped=37, Incomplete=26, 
Source=/var/lib/pacemaker/pengine/pe-input-188.bz2): Stopped
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Demote  
DRBD_data01:0#011(Master -> Stopped pgdbsrv01.cl1.local)
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Promote 
DRBD_data01:1#011(Slave -> Master pgdbsrv02.cl1.local)
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Demote  
DRBD_data02:0#011(Master -> Stopped pgdbsrv01.cl1.local)
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Promote 
DRBD_data02:1#011(Slave -> Master pgdbsrv02.cl1.local)
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Stop
LVM_vgdata01#011(Started pgdbsrv02.cl1.local)
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Start   
FS_data01#011(pgdbsrv02.cl1.local - blocked)
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Stop
LVM_vgdata02#011(Started pgdbsrv02.cl1.local)
Sep  4 00:39:27 pgdbsrv02 pengine[1857]:   notice: LogActions: Start   
FS_data02#011(pgdbsrv02.cl1.local - blocked)
Sep  4 00:39:27 pgdbsrv02 crmd[1858]:  warning: destroy_action: Cance

Re: [Pacemaker] heartbeat:anything resource not stop/monitoring after reboot

2013-09-06 Thread Lars Marowsky-Bree
On 2013-09-05T11:23:20, David Coulson  wrote:

> ocf-tester -n reload -o binfile="/usr/sbin/rndc" -o cmdline_options="reload"
> /usr/lib/ocf/resource.d/heartbeat/anything
> Beginning tests for /usr/lib/ocf/resource.d/heartbeat/anything...
> * rc=1: Monitoring an active resource should return 0
> * rc=1: Probing an active resource should return 0

Well, I can't comment on why it works on some clusters of your
environment. The above does explain quite nicely why it doesn't on the
rest though.

If the process doesn't hang around, you really shouldn't have a periodic
monitor configured.

> Short of writing a resource that does a start and forces a rc=0 for
> stop/monitor, any ideas why this is behaving the way it is?

"stop" seemed to be working fine. Your problem is that you have a
periodic monitor configured, and that will always find this resource as
"failed". Disable the periodic monitor.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10

2013-09-06 Thread Lars Ellenberg
On Tue, Aug 27, 2013 at 06:51:45AM +0200, Andreas Mock wrote:
> Hi Andrew,
> 
> as this is a real showstopper at the moment I invested some other
> hours to be sure (as far as possible) not having made an error.
> 
> Some additions:
> 1) I mirrored the whole mini drbd config to another pacemaker cluster.
> Same result: pacemaker 1.1.8 works, pacemaker 1.1.10 not 
> 2) When I remove the target role Stopped from the drbd ms resource
> and insert the config snippet related to the drbd device via crm -f 
> to a lean running pacemaker config (pacemaker cluster options, stonith
> resources),
> it seems to work. That means one of the nodes gets promoted.
> 
> Then after stopping 'crm resource stop ms_drbd_xxx' and starting again
> I see the same promotion error as described.
> 
> The drbd resource agent is using /usr/sbin/crm_master.
> Is there a possibility that feedback given through this client tool
> is changing the timing behaviour of pacemaker? Or the way
> transitions are scheduled?
> Any idea that may be related to a change in pacemaker?

I think that recent pacemaker allows for "start" and "promote" in the
same transition. Or at least has promote follow start much faster than before.

To be able to really tell why DRBD has problems to promote,
I'd need the drbd (kernel!) logs, the agent logs are not good enough.

Maybe you hit some interesting race between DRBD establishing the
connection, and pacemaker trying to promote one of the nodes.
Correlating DRBD kernel logs with pacemaker logs should tell.

Do you have DRBD resource level fencing enabled?

I guess right now you can reproduce it easily by:
  crm resource stop ms_drbd
  crm resource start ms_drbd

I suspect you would not be able to reproduce by:
  crm resource stop ms_drbd
  crm resource demote ms_drbd (will only make drbd Secondary stuff)
... meanwhile, DRBD will establish the connection ...
  crm resource promote ms_drbd (will then promote one node)

Hth,

Lars


> -Ursprüngliche Nachricht-
> Von: Andrew Beekhof [mailto:and...@beekhof.net] 
> Gesendet: Dienstag, 27. August 2013 05:02
> An: General Linux-HA mailing list
> Cc: pacemaker@oss.clusterlabs.org
> Betreff: Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd
> agent between pacemaker 1.1.8 and 1.1.10
> 
> 
> On 27/08/2013, at 3:31 AM, Andreas Mock  wrote:
> 
> > Hi all,
> > 
> > while the linbit drbd resource agent seems to work perfectly on 
> > pacemaker 1.1.8 (standard software repository) we have problems with 
> > the last release 1.1.10 and also with the newest head 1.1.11.xxx.
> > 
> > As using drbd is not so uncommon I really hope to find interested 
> > people helping me out. I can provide as much debug information as you 
> > want.
> > 
> > 
> > Environment:
> > RHEL 6.4 clone (Scientific Linux 6.4) cman based cluster.
> > DRBD 8.4.3 compiled from sources.
> > 64bit
> > 
> > - A drbd resource configured following the linbit documentation.
> > - Manual start and stop (up/down) and setting primary of drbd resource 
> > working smoothly.
> > - 2 nodes dis03-test/dis04-test
> > 
> > 
> > 
> > - Following simple config on pacemaker 1.1.8 configure
> >property no-quorum-policy=stop
> >property stonith-enabled=true
> >rsc_defaults resource-stickiness=2
> >primitive r_stonith-dis03-test stonith:fence_mock \
> >meta resource-stickiness="INFINITY" target-role="Started" \
> >op monitor interval="180" timeout="300" requires="nothing" \
> >op start interval="0" timeout="300" \
> >op stop interval="0" timeout="300" \
> >params vmname=dis03-test pcmk_host_list="dis03-test"
> >primitive r_stonith-dis04-test stonith:fence_mock \
> >meta resource-stickiness="INFINITY" target-role="Started" \
> >op monitor interval="180" timeout="300" requires="nothing" \
> >op start interval="0" timeout="300" \
> >op stop interval="0" timeout="300" \
> >params vmname=dis04-test pcmk_host_list="dis04-test"
> >location r_stonith-dis03_hates_dis03 r_stonith-dis03-test \
> >rule $id="r_stonith-dis03_hates_dis03-test_rule" -inf: #uname 
> > eq dis03-test
> >location r_stonith-dis04_hates_dis04 r_stonith-dis04-test \
> >rule $id="r_stonith-dis04_hates_dis04-test_rule" -inf: #uname 
> > eq dis04-test
> >primitive r_drbd_postfix ocf:linbit:drbd \
> >params drbd_resource="postfix" drbdconf="/usr/local/etc/drbd.conf"
> \
> >op monitor interval="15s"  timeout="60s" role="Master" \
> >op monitor interval="45s"  timeout="60s" role="Slave" \
> >op start timeout="240" \
> >op stop timeout="240" \
> >meta target-role="Stopped" migration-threshold="2"
> >ms ms_drbd_postfix r_drbd_postfix \
> >meta master-max="1" master-node-max="1" \
> >clone-max="2" clone-node-max="1" \
> >notify="true" \
> >meta target-role="Stopped"
> > commit
> > 
> > - Pacemaker is started from scratch
> > - Conf

Re: [Pacemaker] Need help with quickstart of pacemaker on redhat

2013-09-06 Thread Andrew Beekhof


On 29/08/2013, at 7:41 PM, Moturi Upendra  wrote:

> Please find the attachment

The problem is that the migration-threshold has been set as a cluster option 
rather than a resource default.
In rhel 6.5 you'll be able to use this command:

   pcs resource defaults migration-threshold=1

In 6.4 it is:

   pcs resource rsc defaults migration-threshold=1

I'll update the quickstart now.

> 
> Thanks
> Upendra
> 
> 
> On Thu, Aug 29, 2013 at 2:55 PM, Andrew Beekhof  wrote:
> Pacemaker configuration. cibadmin -Ql
> 
> Sent from a mobile device
> 
> On 29/08/2013, at 7:02 PM, Moturi Upendra  wrote:
> 
>> Hi,
>> 
>> here it is
>> 
>> 
>>   
>>   
>> 
>>   
>> 
>>   
>> 
>>   
>> 
>> 
>>   
>> 
>>   
>> 
>>   
>> 
>>   
>>   
>>   
>>   
>> 
>>   
>>   
>> 
>> 
>>   
>> 
>> 
>> 
>> Just executed those steps in the document.
>> 
>> Thanks
>> Upendra
>> 
>> 
>> On Thu, Aug 29, 2013 at 2:59 AM, Andrew Beekhof  wrote:
>> 
>> On 28/08/2013, at 8:18 PM, Moturi Upendra  wrote:
>> 
>> > Thank's for the reply,
>> > but as per doc it says that it has to move to different node
>> 
>> Can you show your configuration please?
>> 
>> >
>> > From the document:
>> >
>> > Simulate a Service Failure
>> >
>> > We can simulate an error by telling the service to stop directly (without 
>> > telling the cluster):
>> >
>> > [ONE] # crm_resource --resource my_first_svc --force-stop
>> >
>> > If you now run crm_mon in interactive mode (the default), you should see 
>> > (within the monitor interval - 2 minutes) the cluster notice that 
>> > my_first_svc failed and move it to another node.
>> >
>> >
>> >
>> > thanks
>> >
>> > Upendra
>> >
>> >
>> >
>> > On Wed, Aug 28, 2013 at 3:51 AM, Andrew Beekhof  wrote:
>> >
>> > On 28/08/2013, at 12:12 AM, Moturi Upendra  
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > I followed your article in setting up 2-node cluster with pacemaker on 
>> > > redhat 6.4
>> > > http://clusterlabs.org/quickstart-redhat.html
>> > >
>> > > I just executed the same steps you have mentioned in document.
>> > > When i am trying to test failure condition to start the dummy agent on 
>> > > node2 ,it throws an error saying the
>> > > "my_first_svc_monitor_3 (node=node1, call=76, rc=7, 
>> > > status=complete): not running"
>> > >
>> > > Please help in understanding the error.
>> >
>> > Thats the cluster detecting the resource was stopped - which is expected 
>> > since you stopped it.
>> >
>> 
>> 
> 
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org