W dniu 25.11.2013 15:44, Digimer pisze:
My first thought is that the network is congested. That is a lot of
servers to have on the system. Do you or can you isolate the corosync
traffic from the drbd traffic?

Personally, I always setup a dedicated network for corosync, another for
drbd and a third for all traffic to/from the servers. With this, I have
never had a congestion-based problem.

If possible, please past all logs from both nodes, starting just before
the stonith occurred until recovery completed please.


Hello,

DRBD and CRM go over dedicated link (bonded two gigabit links into one). It is never saturated nor congested, it barely reaches 300 Mbps in highest points. I have a separate link for traffic from/to virtual machines and also separate link to manage nodes (just for SSH, SNMP). I can isolate corosync to separate link but it could take some time to do.

Now logs...

Trouble started at November 23, 15:14.
Here is a log from "A" node: http://pastebin.com/yM1fqvQ6
Node B: http://pastebin.com/nwbctcgg

Node B is the one that got hit by STONITH. It got killed at 15:18:50. I have some trouble understanding reasons for that.

Is reason for STONITH that those operations took long time to finish?

Nov 23 15:14:49 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the operation stop[114] on XEN-piaskownica for client 9529 stayed in operation list for 24760 ms (longer than 10000 ms) Nov 23 15:14:50 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the operation stop[115] on XEN-acsystemy01 for client 9529 stayed in operation list for 25760 ms (longer than 10000 ms) Nov 23 15:15:15 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the operation stop[116] on XEN-frodo for client 9529 stayed in operation list for 50760 ms (longer than 10000 ms)

But I wonder what in first place made it to stop those virtual machines? Another clue is here:

Nov 23 15:15:43 rivendell-B lrmd: [9526]: WARN: configuration advice: reduce operation contention either by increasing lrmd max_children or by increasing intervals of monitor operations

And here:

coro-A.log:Nov 23 15:14:19 rivendell-A pengine: [8839]: WARN: unpack_rsc_op: Processing failed op primitive-LVM:1_last_failure_0 on rivendell-B: not running (7)

But why not running? It is not really a true. Also some trouble with fencing:

coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN: unpack_rsc_op: Processing failed op fencing-of-B_last_failure_0 on rivendell-A: unknown error (1) coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN: common_apply_stickiness: Forcing fencing-of-B away from rivendell-A after 1000000 failures (max=1000000)

Thank you!

--
Michał Margula, alche...@uznam.net.pl, http://alchemyx.uznam.net.pl/
"W życiu piękne są tylko chwile" [Ryszard Riedel]

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to