Re: [Pacemaker] OCFS2 problems when connectivity lost

Tim Serong Wed, 21 Dec 2011 04:12:13 -0800

On 12/21/2011 09:47 PM, Ivan Savčić | Epix wrote:

Hello,



We are having a problem with a 3-node cluster based on
Pacemaker/Corosync with 2 primary DRBD+OCFS2 nodes and a quorum node.

Nodes run on Debian Squeeze, all packages are from the stable branch
except for Corosync (which is from backports for udpu functionality).
Each node has a single network card.

When the network is up, everything works without any problems, graceful
shutdown of resources on any node works as intended and doesn't reflect
on the remaining cluster partition.

When the network is down on one OCFS2 node, Pacemaker
(no-quorum-policy="stop") tries to shut the resources down on that node,
but fails to stop the OCFS2 filesystem resource stating that it is "in
use".

*Both* OCFS2 nodes (ie. the one with the network down and the one which
is still up in the partition with quorum) hang with dmesg reporting that
events, ocfs2rec and ocfs2_wq are "blocked for more than 120 seconds".


My guess would be:

The filesystem can't stop on the non-quorate node, because the networkconnection is down, so DLM can't do its thing.

The filesystem is probably frozen on the quorate node, because of lossof DLM comms.

If STONITH is configured, the non-quorate node should be killed after afailed (or timed out) stop, and the quorate node should resume behavingnormally.


HTH,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
[email protected]

_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] OCFS2 problems when connectivity lost

Reply via email to