On 6/6/07, kisalay <[EMAIL PROTECTED]> wrote:
Hi,
I am running 2.0.8 linux-ha on a 2 node system.
I ran into a problem while failover from one node to another.
The sequence of actions:
1. The two nodes are running on unequal h/w. int01 is running on dell-2950
with single cpu 4 gb ram. int02 is running on dell-2950 with dual cpu and
8gb ram.
2. The node int02 was initially active and int01 was standby.
as in just not running any resources or really in standby mode where
its not allowed to run resources?
3. In my setup, whenever the failover happens, the node taking over
restarts the earlier active heartbeat to make it forget all the failcounts.
This is done in the start scripts of the first resource.
4. Due to resource failovers ( process kills ), the failover happens from
int02 to int01.
5. When int01 becomes active, int02 is still the dc and is running pengine
and tengine.
6. When int01 starts its resources, the restart is issued to the heartbeat
on the int02.
automatically or manually?
7. When the int02 heartbeat is shutting down, following logs are seen:
May 27 17:02:47 indica-int02 cib: [1686]: info: cib_shutdown: Disconnected
0 clients
May 27 17:02:47 indica-int02 cib: [1686]: info: cib_process_disconnect: All
clients disconnected...
May 27 17:02:47 indica-int02 cib: [1686]: info: initiate_exit: Sending
disconnect notification to 2 peers...
May 27 17:02:52 indica-int02 cib: [1686]: notice: cib_force_exit: Forcing
exit!
this means that when it sent out a message saying "i'm outta here",
that it didn't get a response from its peer - which is odd but not
necessarily a problem
May 27 17:02:52 indica-int02 cib: [1686]: info: terminate_ha_connection:
cib_force_exit: Disconnecting heartbeat
May 27 17:02:52 indica-int02 cib: [1686]: info: cib_ha_connection_destroy:
Heartbeat disconnection complete... exiting
May 27 17:02:52 indica-int02 cib: [1686]: info: uninitializeCib: The CIB
has been deallocated.
This suggests that cib force exited on int02. Is it because of anomaly?
8. When int02 comes up, the following logs are seen:
May 27 17:04:01 indica-int02 crmd: [21363]: info: do_state_transition:
indica-int02.pune.nevisnetworks.com: State transition
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE
origin=route_message ]
May 27 17:04:01 indica-int02 cib: [21735]: info: write_cib_contents: Wrote
version 0.12.441 of the CIB to disk (digest:
f1cf5bc300318744927e2fa7c6a48d75)
May 27 17:04:03 indica-int02 cib: [21359]: WARN: cib_peer_callback:
Discarding cib_shutdown_req message (529d0) from
indica-int01.pune.nevisnetworks.com: not in our membership
May 27 17:04:08 indica-int02 cib: [21359]: WARN: cib_peer_callback:
Discarding cib_update message (529dd) from
indica-int01.pune.nevisnetworks.com: not in our membership
also at similar time, the int01 logs say:
May 27 17:04:15 indica-int01 crmd: [32642]: info: ccm_event_detail: NEW
MEMBERSHIP: trans=13, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=3
May 27 17:04:15 indica-int01 crmd: [32642]: info: ccm_event_detail:
CURRENT: indica-int01.pune.nevisnetworks.com [nodeid=0,
born=13]
May 27 17:04:15 indica-int01 cib: [32638]: info: cib_diff_notify: Update
(client: 21363, call:23): 0.12.427 -> 0.12.428 (ok)
May 27 17:04:15 indica-int01 crmd: [32642]: info: ccm_event_detail:
LOST: indica-int02.pune.nevisnetworks.com [nodeid=1,
born=10]
May 27 17:04:15 indica-int01 crmd: [32642]: info: do_election_check: Still
waiting on 1 non-votes (1 total)
So on one hand, int02 fails to recognise the int01 to be in the cluster,
and on the other hand, int01 tells that int02 is offline.
looks to me like the restart of heartbeat is happening too fast for
the CCM to handle.
9. After this, int01 has the following logs:
May 27 17:04:19 indica-int01 crmd: [32642]: info: ccm_event_detail: NEW
MEMBERSHIP: trans=15, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, old_idx=4
May 27 17:04:19 indica-int01 crmd: [32642]: info: ccm_event_detail:
CURRENT: indica-int02.pune.nevisnetworks.com [nodeid=1,
born=1]
May 27 17:04:19 indica-int01 crmd: [32642]: info: ccm_event_detail:
CURRENT: indica-int01.pune.nevisnetworks.com [nodeid=0,
born=15]
May 27 17:04:19 indica-int01 crmd: [32642]: info: ccm_event_detail:
NEW: indica-int02.pune.nevisnetworks.com [nodeid=1,
born=1]
May 27 17:04:19 indica-int01 crmd: [32642]: info: do_election_check: Still
waiting on 2 non-votes (2 total)
May 27 17:04:19 indica-int01 crmd: [32642]: notice:
crmd_ha_status_callback: Status update: Node
indica-int02.pune.nevisnetworks.com now has status [init]
10. The resources have failedover to int01. But the int02 fails to
recognise the int01.
The pengine and tengine are restarted on int02 and they try to restart the
resources on int02 also while they are still running on int01.
classic split-brain behavior i'm afraid :-(
the biggest problem being that there is no reason for the CCM to think
this is one
I went through the release notes which says:
- When running a cluster of nodes of very different speeds temporary
membership anomalies may occasionally be seen. These correct
themselves and don't appear to be harmful. They typically
include a message something like this:
WARN: Ignoring HA message (op=vote) from XXX: not in our
membership list
and also through the description of bug 1367. Is the problem I saw somehow
related to these issues already reported? If so.. then is there any
deterministic way of avoiding/reproducing the issue?
not that i'm happy about saying this, but maybe try stilling "sleep
30" between when you stop and start heartbeat. that should give the
CCM time to sort itself out before the node comes back again.
and out of interest, why are you restarting heartbeat?
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/