Le 08/06/2012 13:01, Juan M. Sierra a écrit :
Problem with state: UNCLEAN (OFFLINE)

Hello,

I'm trying to get up a directord service with pacemaker.

But, I found a problem with the unclean (offline) state. The initial
state of my cluster was this:

    /Online: [ node2 node1 ]

    node1-STONITH (stonith:external/ipmi): Started node2
    node2-STONITH (stonith:external/ipmi): Started node1
    Clone Set: Connected
    Started: [ node2 node1 ]
    Clone Set: ldirector-activo-activo
    Started: [ node2 node1 ]
    ftp-vip (ocf::heartbeat:IPaddr): Started node1
    web-vip (ocf::heartbeat:IPaddr): Started node2

    Migration summary:
    * Node node1: pingd=2000
    * Node node2: pingd=2000
    node2-STONITH: migration-threshold=1000000 fail-count=1000000
    /

and then, I removed the electric connection of node1, the state was the
next:

    /Node node1 (8b2aede9-61bb-4a5a-aef6-25fbdefdddfd): UNCLEAN (offline)
    Online: [ node2 ]

    node1-STONITH (stonith:external/ipmi): Started node2 FAILED
    Clone Set: Connected
    Started: [ node2 ]
    Stopped: [ ping:1 ]
    Clone Set: ldirector-activo-activo
    Started: [ node2 ]
    Stopped: [ ldirectord:1 ]
    web-vip (ocf::heartbeat:IPaddr): Started node2

    Migration summary:
    * Node node2: pingd=2000
    node2-STONITH: migration-threshold=1000000 fail-count=1000000
    node1-STONITH: migration-threshold=1000000 fail-count=1000000

    Failed actions:
    node2-STONITH_start_0 (node=node2, call=22, rc=2, status=complete):
    invalid parameter
    node1-STONITH_monitor_60000 (node=node2, call=11, rc=14,
    status=complete): status: unknown
    node1-STONITH_start_0 (node=node2, call=34, rc=1, status=complete):
    unknown error
    /

I was hoping that node2 take the management of ftp-vip resource, but it
wasn't in that way. node1 kept in a unclean state and node2 didn't take
the management of its resources. When I put back the electric connection
of node1 and it was recovered then, node2 took the management of ftp-vip
resource.

I've seen some similar conversations here. Please, could you show me
some idea about this subject or some thread where this is discussed?

Thanks a lot!

Regards,


It has been discussed for resource failover but I guess it's the same: http://oss.clusterlabs.org/pipermail/pacemaker/2012-May/014260.html

The motto here (discovered it a couple days ago) is "better have a hanged cluster than a corrupted one, especially with shared filesystem/resources.". So, node1 failed but node2 hasn't been able to confirm its death because stonith failed apparently, then, the design choice is for the cluster to hang while waiting for a way to know the real state of node1 (at reboot in this case).


--
Cheers,
Florian Crouzat

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to