Hi, On Wed, Aug 11, 2010 at 11:48:17AM +0200, philipp.achmuel...@arz.at wrote: > i removed the clone, set the global cluster property for stonith-timeout. > > the nodes need about 3-5 minutes to startup after they get "shot" > > i did some more tests and found out that if the node, which runs resource > sbd_fence, get "shot" the remaining node see the stonith resource online > on both nodes (although one of the cluster-nodes is stonithed).
You meant to say "is going to be stonithed"? Anyway, this looks like a bug. A minor one if it doesn't influence the fencing action. Please file a bugzilla for this and attach hb_report. > crm_mon: > sbd_fence (stonith:external/sbd): Started [ lnx0047a lnx0047b ] > > looking through /var/log/messages: > > Aug 11 11:24:25 lnx0047a pengine: [20618]: info: determine_online_status: > Node lnx0047a is online > Aug 11 11:24:25 lnx0047a pengine: [20618]: WARN: pe_fence_node: Node > lnx0047b will be fenced because it is un-expectedly down > Aug 11 11:24:25 lnx0047a pengine: [20618]: info: > determine_online_status_fencing: ha_state=active, ccm_state=false, > crm_state=online, join_state=pending, expected=member > Aug 11 11:24:25 lnx0047a pengine: [20618]: WARN: determine_online_status: > Node lnx0047b is unclean > Aug 11 11:24:25 lnx0047a pengine: [20618]: ERROR: native_add_running: > Resource stonith::external/sbd:sbd_fence appears to be active on 2 nodes > ... > Aug 11 11:24:26 lnx0047a sbd: [22315]: info: lnx0047b owns slot 0 > Aug 11 11:24:26 lnx0047a sbd: [22315]: info: Writing reset to node slot > lnx0047b > Aug 11 11:24:26 lnx0047a sbd: [22318]: info: lnx0047b owns slot 0 > Aug 11 11:24:26 lnx0047a sbd: [22318]: info: Writing reset to node slot > lnx0047b Was the node fenced at this point? If not, are you sure that sbd was functional? > Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: ERROR: > remote_op_query_timeout: Query 37724c6f-191f-407f-ad24-68028d2b6573 for > lnx0047b timed out > Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: ERROR: remote_op_timeout: > Action reboot (37724c6f-191f-407f-ad24-68028d2b6573) for lnx0047b timed > out > Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: info: remote_op_done: > Notifing clients of 37724c6f-191f-407f-ad24-68028d2b6573 (reboot of > lnx0047b from 11ea7c1e-6034-48e1-b616-a10c92e53a1d by (null)): > 0, rc=-7 > Aug 11 11:24:28 lnx0047a crmd: [20619]: info: log_data_element: > tengine_stonith_callback: StonithOp <remote-op state="0" > st_target="lnx0047b" st_op="reboot" /> > Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: info: stonith_notify_client: > Sending st_fence-notification to client > 20619/15310d8c-6640-4799-8655-10d125b467bd > Aug 11 11:24:28 lnx0047a crmd: [20619]: info: tengine_stonith_callback: > Stonith operation 75/17:74:0:40ea951f-0c79-43af-8adb-adf8d6defe63: > Operation timed out (-7) This timeout seems to be just a few seconds, do you know why? > Aug 11 11:24:28 lnx0047a crmd: [20619]: ERROR: tengine_stonith_callback: > Stonith of lnx0047b failed (-7)... aborting transition. > Aug 11 11:24:28 lnx0047a crmd: [20619]: info: abort_transition_graph: > tengine_stonith_callback:402 - Triggered transition abort (complete=0) : > Stonith failed > Aug 11 11:24:28 lnx0047a crmd: [20619]: info: update_abort_priority: Abort > priority upgraded from 0 to 1000000 > Aug 11 11:24:28 lnx0047a crmd: [20619]: info: update_abort_priority: Abort > action done superceeded by restart > Aug 11 11:24:28 lnx0047a crmd: [20619]: info: tengine_stonith_notify: Peer > lnx0047b was terminated (reboot) by (null) for lnx0047a > (ref=37724c6f-191f-407f-ad24-68028d2b6573): Operation timed out > Aug 11 11:24:28 lnx0047a crmd: [20619]: info: run_graph: > ==================================================== > Aug 11 11:24:28 lnx0047a crmd: [20619]: notice: run_graph: Transition 74 > (Complete=5, Pending=0, Fired=0, Skipped=5, Incomplete=1, > Source=/var/lib/pengine/pe-error-942.bz2): Stopped > ... > > this entries continue infinitely until i manually stop/start sbd_fence > resource. What happened when you did that? > ------------ > still not sure why Ressource lnx0101a will not start on remaining node... According to the logs above, the node reboot action failed, which may be an explanation. Thanks, Dejan > ---------------- > Disclaimer: > Diese Nachricht dient ausschließlich zu Informationszwecken und ist nur > für den Gebrauch des angesprochenen Adressaten bestimmt. > > This message is only for informational purposes and is intended solely for > the use of the addressee. > ---------------- > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker