On Tue, Aug 17, 2010 at 8:09 PM, <claude.duroc...@mcccf.gouv.qc.ca> wrote: > I have a 3 node cluster running Xen resources on SLES11sp1 with HAE. The > nodes are connected to a SAN and Pacemaker controls the start of the shared > disk. From time to time, monitor of LVM volume groups or ocfs2 file system > fails : this triggers a stopping of the shared disk resource but this can't > be completed as Xen resources are running using the shared disk (I don't > know why monitor fails as the resource seems to be running fine) : > > Log patterns: > Aug 13 21:27:49 qcpvms09 crmd: [9677]: ERROR: process_lrm_event: LRM > operation xen_configstore_volume1:1_monitor_120000 (32) Timed Out > (timeout=50000ms) > Aug 13 21:28:09 qcpvms09 crmd: [9677]: ERROR: process_lrm_event: LRM > operation xen_configstore_volume1:1_stop_0 (55) Timed Out (timeout=20000ms) > Aug 13 21:28:29 qcpvms09 crmd: [9677]: ERROR: process_lrm_event: LRM > operation qcdtypo01_monitor_120000 (54) Timed Out (timeout=90000ms) > > Is there a way to have the monitor operation to retry x times before > declaring the resource failed?
No > Or should the monitor part of the LVM > resource or OCFS2 resource be changed? I'd start by increasing the timeouts. If that doesn't work, you'll need to investigate the Filesystem agent to see what is taking so long. > > My running config : > > node qcpvms07 \ > attributes standby="off" > node qcpvms08 \ > attributes standby="off" > node qcpvms09 \ > attributes standby="off" > primitive clvm ocf:lvm2:clvmd \ > operations $id="clvm-operations" \ > op monitor interval="120" timeout="20" start-delay="10" \ > op start interval="0" timeout="30" \ > params daemon_timeout="30" daemon_options="-d0" > primitive dlm ocf:pacemaker:controld \ > operations $id="dlm-operations" \ > op monitor interval="120" timeout="20" start-delay="10" > primitive o2cb ocf:ocfs2:o2cb \ > operations $id="o2cb-operations" \ > op monitor interval="120" timeout="20" start-delay="10" > primitive ping-net1 ocf:pacemaker:ping \ > operations $id="ping-net1-operations" \ > op monitor interval="120" timeout="20" on-fail="restart" start-delay="0" \ > params name="ping-net1" host_list="192.168.88.1 192.168.88.43" interval="15" > timeout="5" attempts="5" \ > meta target-role="started" > primitive qcddom01 ocf:heartbeat:Xen \ > meta target-role="started" \ > operations $id="qcddom01-operations" \ > op monitor interval="120" timeout="30" on-fail="restart" start-delay="60" \ > op start interval="0" timeout="120" start-delay="0" \ > op stop interval="0" timeout="120" \ > op migrate_from interval="0" timeout="240" \ > op migrate_to interval="0" timeout="240" \ > params xmfile="/etc/xen/vm/qcddom01" allow-migrate="true" > primitive qcdtypo01 ocf:heartbeat:Xen \ > meta target-role="started" \ > operations $id="qcdtypo01-operations" \ > op monitor interval="120" timeout="30" on-fail="restart" start-delay="60" \ > op start interval="0" timeout="120" start-delay="0" \ > op stop interval="0" timeout="120" \ > op migrate_from interval="0" timeout="240" \ > op migrate_to interval="0" timeout="240" \ > params xmfile="/etc/xen/vm/qcdtypo01" allow-migrate="true" > primitive stonith-sbd stonith:external/sbd \ > meta target-role="started" \ > operations $id="stonith-sbd-operations" \ > op monitor interval="30" timeout="15" start-delay="30" \ > params sbd_device="/dev/mapper/mpathc" > primitive xen_configstore_volume1 ocf:heartbeat:Filesystem \ > operations $id="xen_configstore_volume1-operations" \ > op monitor interval="120" timeout="40" start-delay="10" \ > params device="/dev/xen_volume1_group/xen_configstore_volume1" > directory="/etc/xen/vm" fstype="ocfs2" > primitive xen_volume1_group ocf:heartbeat:LVM \ > operations $id="xen_volume1_group-operations" \ > op monitor interval="120" timeout="30" start-delay="10" \ > params volgrpname="xen_volume1_group" > primitive xen_volume2_group ocf:heartbeat:LVM \ > operations $id="xen_volume2_group-operations" \ > op monitor interval="120" timeout="30" start-delay="10" \ > params volgrpname="xen_volume2_group" > group shared-disk-group dlm clvm o2cb xen_volume1_group xen_volume2_group > xen_configstore_volume1 \ > meta target-role="started" > clone ping-clone ping-net1 \ > meta target-role="started" interleave="true" ordered="true" > clone shared-disk-clone shared-disk-group \ > meta target-role="stopped" > location qcddom01-on-ping-net1 qcddom01 \ > rule $id="qcddom01-on-ping-net1-rule" -inf: not_defined ping-net1 or > ping-net1 lte 0 > location qcddom01-prefer-qcpvms08 qcddom01 500: qcpvms08 > location qcdtypo01-on-ping-net1 qcdtypo01 \ > rule $id="qcdtypo01-on-ping-net1-rule" -inf: not_defined ping-net1 or > ping-net1 lte 0 > location qcdtypo01-prefer-qcpvms07 qcdtypo01 500: qcpvms07 > colocation colocation-qcddom01-shared-disk-clone inf: qcddom01 > shared-disk-clone > colocation colocation-qcdtypo01-shared-disk-clone inf: qcdtypo01 > shared-disk-clone > order order-qcddom01 inf: shared-disk-clone qcddom01 > order order-qcdtypo01 inf: shared-disk-clone qcdtypo01 > property $id="cib-bootstrap-options" \ > dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \ > cluster-infrastructure="openais" \ > no-quorum-policy="freeze" \ > default-resource-stickiness="500" \ > last-lrm-refresh="1281552641" \ > expected-quorum-votes="3" \ > stonith-timeout="240s" > op_defaults $id="op_defaults-options" \ > record-pending="false" > > Claude > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker