Right, this one looks better. I'll refer to nodes as 1001 and 1002.
1002 is your DC. You have stonith enabled, but no stonith devices. Disable stonith or get and configure a stonith device (_please_ dont use ssh). 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l 978). Retries in l 1018 and fails again in l 1035. Then, the cluster tries to start drbd on 1001 in l 1079, followed by a bunch of kernel messages I don't understand (pretty sure _this_ is the first problem you should address!), ending up in the drbd RA not able to see the secondary state (1449) and considering the start failed. The RA code for this is if do_drbdadm up $RESOURCE ; then drbd_get_status if [ "$DRBD_STATE_LOCAL" != "Secondary" ]; then ocf_log err "$RESOURCE start: not in Secondary mode after start." return $OCF_ERR_GENERIC fi ocf_log debug "$RESOURCE start: succeeded." return $OCF_SUCCESS else ocf_log err "$RESOURCE: Failed to start up." return $OCF_ERR_GENERIC fi The cluster then successfully stops drbd again (l 1508-1511) and tries to start the other clone instance (l 1523). Log says RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in Secondary mode after start. So this is interesting. Although "stop" (basically drbdadm down) succeeded, the drbd device is still attached! Please try: stop the cluster drbdadm up $resource drbdadm up $resource #again echo $? drbdadm down $resource echo $? cat /proc/drbd Btw: Does your userland match your kernel module version? To bring this to an end: The start of the second clone instance also failed, so both instances are unrunnable on the node and no further start is tried on 1002. Interestingly, then (could not see any attempt before), the cluster wants to start drbd on node 1001, but it also fails and also gives those kernel messages. In l 2001, each instance has a failed start on each node. So: Find out about those kernel messages. Can't help much on that unfortunately, but there were some threads about things like that on drbd-user recently. Maybe you can find answers to that problem there. And also: please verify returncodes of drbdadm in your case. Maybe that's a drbd tools bug? (can't say for sure, for me, up on an alreay up resource gives 1, which is ok). Regards Dominik Jason Fitzpatrick wrote: > it seems that I had the incorrect version of openais installed (from the > fedora repo vs the HA one) > > I have corrected and the hb_report ran correctly using the following > > hb_report -u root -f 3pm /tmp/report > > Please see attached > > Thanks again > > Jason _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems