Hi,
I am running 2.0.8 linux-ha on a 2 node system.
I ran into a problem while failover from one node to another.
The sequence of actions:
1. The two nodes are running on unequal h/w. int01 is running on dell-2950
with single cpu 4 gb ram. int02 is running on dell-2950 with dual cpu and
8gb ram.
2. The node int02 was initially active and int01 was standby.
3. In my setup, whenever the failover happens, the node taking over restarts
the earlier active heartbeat to make it forget all the failcounts. This is
done in the start scripts of the first resource.
4. Due to resource failovers ( process kills ), the failover happens from
int02 to int01.
5. When int01 becomes active, int02 is still the dc and is running pengine
and tengine.
6. When int01 starts its resources, the restart is issued to the heartbeat
on the int02.
7. When the int02 heartbeat is shutting down, following logs are seen:
May 27 17:02:47 indica-int02 cib: [1686]: info: cib_shutdown: Disconnected 0
clients
May 27 17:02:47 indica-int02 cib: [1686]: info: cib_process_disconnect: All
clients disconnected...
May 27 17:02:47 indica-int02 cib: [1686]: info: initiate_exit: Sending
disconnect notification to 2 peers...
May 27 17:02:52 indica-int02 cib: [1686]: notice: cib_force_exit: Forcing
exit!
May 27 17:02:52 indica-int02 cib: [1686]: info: terminate_ha_connection:
cib_force_exit: Disconnecting heartbeat
May 27 17:02:52 indica-int02 cib: [1686]: info: cib_ha_connection_destroy:
Heartbeat disconnection complete... exiting
May 27 17:02:52 indica-int02 cib: [1686]: info: uninitializeCib: The CIB has
been deallocated.
This suggests that cib force exited on int02. Is it because of anomaly?
8. When int02 comes up, the following logs are seen:
May 27 17:04:01 indica-int02 crmd: [21363]: info: do_state_transition:
indica-int02.pune.nevisnetworks.com: State transition S_TRANSITION_ENGINE ->
S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]
May 27 17:04:01 indica-int02 cib: [21735]: info: write_cib_contents: Wrote
version 0.12.441 of the CIB to disk (digest:
f1cf5bc300318744927e2fa7c6a48d75)
May 27 17:04:03 indica-int02 cib: [21359]: WARN: cib_peer_callback:
Discarding cib_shutdown_req message (529d0) from
indica-int01.pune.nevisnetworks.com: not in our membership
May 27 17:04:08 indica-int02 cib: [21359]: WARN: cib_peer_callback:
Discarding cib_update message (529dd) from
indica-int01.pune.nevisnetworks.com: not in our membership
also at similar time, the int01 logs say:
May 27 17:04:15 indica-int01 crmd: [32642]: info: ccm_event_detail: NEW
MEMBERSHIP: trans=13, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=3
May 27 17:04:15 indica-int01 crmd: [32642]: info: ccm_event_detail:
CURRENT: indica-int01.pune.nevisnetworks.com [nodeid=0, born=13]
May 27 17:04:15 indica-int01 cib: [32638]: info: cib_diff_notify: Update
(client: 21363, call:23): 0.12.427 -> 0.12.428 (ok)
May 27 17:04:15 indica-int01 crmd: [32642]: info: ccm_event_detail:
LOST: indica-int02.pune.nevisnetworks.com [nodeid=1, born=10]
May 27 17:04:15 indica-int01 crmd: [32642]: info: do_election_check: Still
waiting on 1 non-votes (1 total)
So on one hand, int02 fails to recognise the int01 to be in the cluster, and
on the other hand, int01 tells that int02 is offline.
9. After this, int01 has the following logs:
May 27 17:04:19 indica-int01 crmd: [32642]: info: ccm_event_detail: NEW
MEMBERSHIP: trans=15, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, old_idx=4
May 27 17:04:19 indica-int01 crmd: [32642]: info: ccm_event_detail:
CURRENT: indica-int02.pune.nevisnetworks.com [nodeid=1, born=1]
May 27 17:04:19 indica-int01 crmd: [32642]: info: ccm_event_detail:
CURRENT: indica-int01.pune.nevisnetworks.com [nodeid=0, born=15]
May 27 17:04:19 indica-int01 crmd: [32642]: info: ccm_event_detail:
NEW: indica-int02.pune.nevisnetworks.com [nodeid=1, born=1]
May 27 17:04:19 indica-int01 crmd: [32642]: info: do_election_check: Still
waiting on 2 non-votes (2 total)
May 27 17:04:19 indica-int01 crmd: [32642]: notice: crmd_ha_status_callback:
Status update: Node indica-int02.pune.nevisnetworks.com now has status
[init]
10. The resources have failedover to int01. But the int02 fails to recognise
the int01.
The pengine and tengine are restarted on int02 and they try to restart the
resources on int02 also while they are still running on int01.
I went through the release notes which says:
- When running a cluster of nodes of very different speeds temporary
membership anomalies may occasionally be seen. These correct
themselves and don't appear to be harmful. They typically
include a message something like this:
WARN: Ignoring HA message (op=vote) from XXX: not in our
membership list
and also through the description of bug 1367. Is the problem I saw somehow
related to these issues already reported? If so.. then is there any
deterministic way of avoiding/reproducing the issue?
Thanks
Kisalay
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/