W dniu 25.11.2013 15:44, Digimer pisze:
My first thought is that the network is congested. That is a lot of
servers to have on the system. Do you or can you isolate the corosync
traffic from the drbd traffic?
Personally, I always setup a dedicated network for corosync, another for
drbd and a third for all traffic to/from the servers. With this, I have
never had a congestion-based problem.
If possible, please past all logs from both nodes, starting just before
the stonith occurred until recovery completed please.
Hello,
DRBD and CRM go over dedicated link (bonded two gigabit links into one).
It is never saturated nor congested, it barely reaches 300 Mbps in
highest points. I have a separate link for traffic from/to virtual
machines and also separate link to manage nodes (just for SSH, SNMP). I
can isolate corosync to separate link but it could take some time to do.
Now logs...
Trouble started at November 23, 15:14.
Here is a log from "A" node: http://pastebin.com/yM1fqvQ6
Node B: http://pastebin.com/nwbctcgg
Node B is the one that got hit by STONITH. It got killed at 15:18:50. I
have some trouble understanding reasons for that.
Is reason for STONITH that those operations took long time to finish?
Nov 23 15:14:49 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the
operation stop[114] on XEN-piaskownica for client 9529 stayed in
operation list for 24760 ms (longer than 10000 ms)
Nov 23 15:14:50 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the
operation stop[115] on XEN-acsystemy01 for client 9529 stayed in
operation list for 25760 ms (longer than 10000 ms)
Nov 23 15:15:15 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the
operation stop[116] on XEN-frodo for client 9529 stayed in operation
list for 50760 ms (longer than 10000 ms)
But I wonder what in first place made it to stop those virtual machines?
Another clue is here:
Nov 23 15:15:43 rivendell-B lrmd: [9526]: WARN: configuration advice:
reduce operation contention either by increasing lrmd max_children or by
increasing intervals of monitor operations
And here:
coro-A.log:Nov 23 15:14:19 rivendell-A pengine: [8839]: WARN:
unpack_rsc_op: Processing failed op primitive-LVM:1_last_failure_0 on
rivendell-B: not running (7)
But why not running? It is not really a true. Also some trouble with
fencing:
coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN:
unpack_rsc_op: Processing failed op fencing-of-B_last_failure_0 on
rivendell-A: unknown error (1)
coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN:
common_apply_stickiness: Forcing fencing-of-B away from rivendell-A
after 1000000 failures (max=1000000)
Thank you!
--
Michał Margula, alche...@uznam.net.pl, http://alchemyx.uznam.net.pl/
"W życiu piękne są tylko chwile" [Ryszard Riedel]
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org