Hi, On Tue, Aug 10, 2010 at 10:16:05AM +0200, philipp.achmuel...@arz.at wrote: > hi, > > following configuration: > > node lnx0047a > node lnx0047b > primitive lnx0101a ocf:heartbeat:KVM \ > params name="lnx0101a" \ > meta allow-migrate="1" target-role="Started" \ > op migrate_from interval="0" timeout="3600s" \ > op migrate_to interval="0" timeout="3600s" \ > op monitor interval="10s" \ > op stop interval="0" timeout="360s" > primitive lnx0102a ocf:heartbeat:KVM \ > params name="lnx0102a" \ > meta allow-migrate="1" target-role="Started" \ > op migrate_from interval="0" timeout="3600s" \ > op migrate_to interval="0" timeout="3600s" \ > op monitor interval="10s" \ > op stop interval="0" timeout="360s" > primitive pingd ocf:pacemaker:pingd \ > params host_list="192.168.136.100" multiplier="100" \ > op monitor interval="15s" timeout="5s" > primitive sbd_fence stonith:external/sbd \ > params sbd_device="/dev/hdisk-4652-38b5" stonith-timeout="60s" > clone fence sbd_fence \ > meta target-role="Started"
You shouldn't run sbd as a clone. > clone pingdclone pingd \ > meta globally-unique="false" target-role="Started" > location lnx0101a_ip lnx0101a \ > rule $id="lnx0101a_ip-rule" -inf: not_defined pingd or pingd lte 0 > location lnx0102a_ip lnx0102a \ > rule $id="lnx0102a_ip-rule" -inf: not_defined pingd or pingd lte 0 > property $id="cib-bootstrap-options" \ > dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \ > cluster-infrastructure="openais" \ > expected-quorum-votes="2" \ > stonith-enabled="true" \ > stonith-action="reboot" \ > no-quorum-policy="ignore" \ > default-resource-stickiness="1000" \ > last-lrm-refresh="1281364675" > > ------------------------------- > during clustertest i disabled the interface where pingd ist listening on > node lnx0047a. i get "Node lnx0047a: UNCLEAN (offline)" on lnx0047b, the > stonith command is being executed: > > /var/log/messages: > ... > Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: pe_fence_node: Node > lnx0047a will be fenced because it is un-expectedly down > ... > Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Action > lnx0102a_stop_0 on lnx0047a is unrunnable (offline) > Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Marking > node lnx0047a unclean > Aug 9 16:25:05 lnx0047b pengine: [22211]: notice: RecurringOp: Start > recurring monitor (10s) for lnx0102a on lnx0047b > Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Action > pingd:0_stop_0 on lnx0047a is unrunnable (offline) > Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Marking > node lnx0047a unclean > Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Action > sbd_fence:0_stop_0 on lnx0047a is unrunnable (offline) > Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Marking > node lnx0047a unclean > Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: stage6: Scheduling Node > lnx0047a for STONITH > Aug 9 16:25:05 lnx0047b pengine: [22211]: info: native_stop_constraints: > lnx0102a_stop_0 is implicit after lnx0047a is fenced > Aug 9 16:25:05 lnx0047b pengine: [22211]: info: native_stop_constraints: > pingd:0_stop_0 is implicit after lnx0047a is fenced > .... > Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: > initiate_remote_stonith_op: Initiating remote operation reboot for > lnx0047a: ee3d0c69-067a-423b-88bc-6d661a2b3254 > Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: log_data_element: > stonith_query: Query <stonith_command t="stonith-ng" > st_async_id="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_op="st_query" > st_callid="0" st_callopt="0" > st_remote_op="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_target="lnx0047a" > st_device_action="reboot" > st_clientid="eba960fb-ef44-4ffb-a017-d5e01177b4ec" src="lnx0047b" seq="32" > /> > Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: > can_fence_host_with_device: sbd_fence:1 can fence lnx0047a: dynamic-list > Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: stonith_query: Found 1 > matching devices for 'lnx0047a' > Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: stonith_command: > Processed st_query from lnx0047b: rc=1 > Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: call_remote_stonith: > Requesting that lnx0047b perform op reboot lnx0047a > Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: log_data_element: > stonith_fence: Exec <stonith_command t="stonith-ng" > st_async_id="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_op="st_fence" > st_callid="0" st_callopt="0" > st_remote_op="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_target="lnx0047a" > st_device_action="reboot" src="lnx0047b" seq="34" /> > Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: > can_fence_host_with_device: sbd_fence:1 can fence lnx0047a: dynamic-list > Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: stonith_fence: Found 1 > matching devices for 'lnx0047a' > Aug 9 16:25:26 lnx0047b pengine: [22211]: WARN: process_pe_message: > Transition 6: WARNINGs found during PE processing. PEngine Input stored > in: /var/lib/pengine/pe-warn-102.bz2 > Aug 9 16:25:26 lnx0047b pengine: [22211]: info: process_pe_message: > Configuration WARNINGs found during PE processing. Please run "crm_verify > -L" to identify issues. > Aug 9 16:25:26 lnx0047b sbd: [23278]: info: reset successfully delivered > to lnx0047a > Aug 9 16:25:27 lnx0047b sbd: [23845]: info: lnx0047a owns slot 1 > Aug 9 16:25:27 lnx0047b sbd: [23845]: info: Writing reset to node slot > lnx0047a > .... > ------- > ps -eaf: > ... > root 24002 24001 0 16:25 ? 00:00:00 stonith -t external/sbd > sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T > reset lnx0047a > root 24007 24002 0 16:25 ? 00:00:00 /bin/bash > /usr/lib64/stonith/plugins/external/sbd reset lnx0047a > root 24035 22192 0 16:25 ? 00:00:00 > /usr/lib64/heartbeat/stonithd > ... So far it looks normal. > lnx0047a reboots successful, but during the image startup of images > lnx0047a several stonith commands being executed on the online > clusternode: > > $ ps -eaf|grep ston > root 22207 22192 0 16:15 ? 00:00:00 > /usr/lib64/heartbeat/stonithd > root 23272 23271 0 16:25 ? 00:00:00 stonith -t external/sbd > sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T > reset lnx0047a > root 23277 23272 0 16:25 ? 00:00:00 /bin/bash > /usr/lib64/stonith/plugins/external/sbd reset lnx0047a > root 23340 23339 0 16:26 ? 00:00:00 stonith -t external/sbd > sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T > reset lnx0047a > root 23345 23340 0 16:26 ? 00:00:00 /bin/bash > /usr/lib64/stonith/plugins/external/sbd reset lnx0047a > root 23438 23437 0 16:26 ? 00:00:00 stonith -t external/sbd > sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T > reset lnx0047a > root 23443 23438 0 16:26 ? 00:00:00 /bin/bash > /usr/lib64/stonith/plugins/external/sbd reset lnx0047a This looks strange. > after lnx0047a is up again it get stonithed automatically by lnx0047b, > althought the cluster isn't up and running (autostart watchdog) > > ----------------- > so, i'm unable to start lnx0047a until i manually kill alle the stonith > processes on lnx0047b. > > during reboot-cycle on lnx0047a the Resources aren't able to start on > lnx0047b: > > $ crm_verify -LV > crm_verify[27816]: 2010/08/09_16:25:41 WARN: pe_fence_node: Node lnx0047a > will be fenced because it is un-expectedly down > crm_verify[27816]: 2010/08/09_16:25:41 WARN: determine_online_status: Node > lnx0047a is unclean > crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action > lnx0101a_stop_0 on lnx0047a is unrunnable (offline) > crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node > lnx0047a unclean > crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action > lnx0102a_stop_0 on lnx0047a is unrunnable (offline) > crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node > lnx0047a unclean > crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action > pingd:0_stop_0 on lnx0047a is unrunnable (offline) > crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node > lnx0047a unclean > crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action > sbd_fence:1_stop_0 on lnx0047a is unrunnable (offline) > crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node > lnx0047a unclean > crm_verify[27816]: 2010/08/09_16:25:41 WARN: stage6: Scheduling Node > lnx0047a for STONITH > > ############### > any ideas on the stonith problem? We'd need full logs. Can you please open a bugzilla and attach a report generated by hb_report for the incident. > any ideas on the "unrunnable" problem? That's expected: one can't run operations on a node which is offline. Thanks, Dejan > regards > ---------------- > Disclaimer: > Diese Nachricht dient ausschließlich zu Informationszwecken und ist nur > für den Gebrauch des angesprochenen Adressaten bestimmt. > > This message is only for informational purposes and is intended solely for > the use of the addressee. > ---------------- > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker