Le 08/06/2012 13:01, Juan M. Sierra a écrit :
Problem with state: UNCLEAN (OFFLINE)
Hello,
I'm trying to get up a directord service with pacemaker.
But, I found a problem with the unclean (offline) state. The initial
state of my cluster was this:
/Online: [ node2 node1 ]
node1-STONITH (stonith:external/ipmi): Started node2
node2-STONITH (stonith:external/ipmi): Started node1
Clone Set: Connected
Started: [ node2 node1 ]
Clone Set: ldirector-activo-activo
Started: [ node2 node1 ]
ftp-vip (ocf::heartbeat:IPaddr): Started node1
web-vip (ocf::heartbeat:IPaddr): Started node2
Migration summary:
* Node node1: pingd=2000
* Node node2: pingd=2000
node2-STONITH: migration-threshold=1000000 fail-count=1000000
/
and then, I removed the electric connection of node1, the state was the
next:
/Node node1 (8b2aede9-61bb-4a5a-aef6-25fbdefdddfd): UNCLEAN (offline)
Online: [ node2 ]
node1-STONITH (stonith:external/ipmi): Started node2 FAILED
Clone Set: Connected
Started: [ node2 ]
Stopped: [ ping:1 ]
Clone Set: ldirector-activo-activo
Started: [ node2 ]
Stopped: [ ldirectord:1 ]
web-vip (ocf::heartbeat:IPaddr): Started node2
Migration summary:
* Node node2: pingd=2000
node2-STONITH: migration-threshold=1000000 fail-count=1000000
node1-STONITH: migration-threshold=1000000 fail-count=1000000
Failed actions:
node2-STONITH_start_0 (node=node2, call=22, rc=2, status=complete):
invalid parameter
node1-STONITH_monitor_60000 (node=node2, call=11, rc=14,
status=complete): status: unknown
node1-STONITH_start_0 (node=node2, call=34, rc=1, status=complete):
unknown error
/
I was hoping that node2 take the management of ftp-vip resource, but it
wasn't in that way. node1 kept in a unclean state and node2 didn't take
the management of its resources. When I put back the electric connection
of node1 and it was recovered then, node2 took the management of ftp-vip
resource.
I've seen some similar conversations here. Please, could you show me
some idea about this subject or some thread where this is discussed?
Thanks a lot!
Regards,
It has been discussed for resource failover but I guess it's the same:
http://oss.clusterlabs.org/pipermail/pacemaker/2012-May/014260.html
The motto here (discovered it a couple days ago) is "better have a
hanged cluster than a corrupted one, especially with shared
filesystem/resources.".
So, node1 failed but node2 hasn't been able to confirm its death because
stonith failed apparently, then, the design choice is for the cluster to
hang while waiting for a way to know the real state of node1 (at reboot
in this case).
--
Cheers,
Florian Crouzat
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org