Re: [Pacemaker] Pacemaker very often STONITHs other node

Michał Margula Mon, 25 Nov 2013 07:45:34 -0800

W dniu 25.11.2013 15:44, Digimer pisze:

My first thought is that the network is congested. That is a lot of
servers to have on the system. Do you or can you isolate the corosync
traffic from the drbd traffic?


Personally, I always setup a dedicated network for corosync, another for
drbd and a third for all traffic to/from the servers. With this, I have
never had a congestion-based problem.

If possible, please past all logs from both nodes, starting just before
the stonith occurred until recovery completed please.


Hello,

DRBD and CRM go over dedicated link (bonded two gigabit links into one).It is never saturated nor congested, it barely reaches 300 Mbps inhighest points. I have a separate link for traffic from/to virtualmachines and also separate link to manage nodes (just for SSH, SNMP). Ican isolate corosync to separate link but it could take some time to do.


Now logs...

Trouble started at November 23, 15:14.
Here is a log from "A" node: http://pastebin.com/yM1fqvQ6
Node B: http://pastebin.com/nwbctcgg

Node B is the one that got hit by STONITH. It got killed at 15:18:50. Ihave some trouble understanding reasons for that.


Is reason for STONITH that those operations took long time to finish?

Nov 23 15:14:49 rivendell-B lrmd: [9526]: WARN: perform_ra_op: theoperation stop[114] on XEN-piaskownica for client 9529 stayed inoperation list for 24760 ms (longer than 10000 ms)Nov 23 15:14:50 rivendell-B lrmd: [9526]: WARN: perform_ra_op: theoperation stop[115] on XEN-acsystemy01 for client 9529 stayed inoperation list for 25760 ms (longer than 10000 ms)Nov 23 15:15:15 rivendell-B lrmd: [9526]: WARN: perform_ra_op: theoperation stop[116] on XEN-frodo for client 9529 stayed in operationlist for 50760 ms (longer than 10000 ms)

But I wonder what in first place made it to stop those virtual machines?Another clue is here:

Nov 23 15:15:43 rivendell-B lrmd: [9526]: WARN: configuration advice:reduce operation contention either by increasing lrmd max_children or byincreasing intervals of monitor operations


And here:

coro-A.log:Nov 23 15:14:19 rivendell-A pengine: [8839]: WARN:unpack_rsc_op: Processing failed op primitive-LVM:1_last_failure_0 onrivendell-B: not running (7)

But why not running? It is not really a true. Also some trouble withfencing:

coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN:unpack_rsc_op: Processing failed op fencing-of-B_last_failure_0 onrivendell-A: unknown error (1)coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN:common_apply_stickiness: Forcing fencing-of-B away from rivendell-Aafter 1000000 failures (max=1000000)


Thank you!

--
Michał Margula, alche...@uznam.net.pl, http://alchemyx.uznam.net.pl/
"W życiu piękne są tylko chwile" [Ryszard Riedel]

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Pacemaker very often STONITHs other node

Reply via email to