On 12/21/2011 09:47 PM, Ivan Savčić | Epix wrote:
Hello,
We are having a problem with a 3-node cluster based on
Pacemaker/Corosync with 2 primary DRBD+OCFS2 nodes and a quorum node.
Nodes run on Debian Squeeze, all packages are from the stable branch
except for Corosync (which is from backports for udpu functionality).
Each node has a single network card.
When the network is up, everything works without any problems, graceful
shutdown of resources on any node works as intended and doesn't reflect
on the remaining cluster partition.
When the network is down on one OCFS2 node, Pacemaker
(no-quorum-policy="stop") tries to shut the resources down on that node,
but fails to stop the OCFS2 filesystem resource stating that it is "in
use".
*Both* OCFS2 nodes (ie. the one with the network down and the one which
is still up in the partition with quorum) hang with dmesg reporting that
events, ocfs2rec and ocfs2_wq are "blocked for more than 120 seconds".
My guess would be:
The filesystem can't stop on the non-quorate node, because the network
connection is down, so DLM can't do its thing.
The filesystem is probably frozen on the quorate node, because of loss
of DLM comms.
If STONITH is configured, the non-quorate node should be killed after a
failed (or timed out) stop, and the quorate node should resume behaving
normally.
HTH,
Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org