Hi Dominik thanks again for the feedback,
I had noticed some kernel opps's since the last kernel update that i and they seem to be pointing to DRBD, i will downgrade the kernel again and see if this improves things, re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9 but must have missed this bit. user land and kernel module all report the same version. I am on my way into the office now and I will apply the changes once there thanks again Jason 2009/2/12 Dominik Klein <d...@in-telegence.net> > Right, this one looks better. > > I'll refer to nodes as 1001 and 1002. > > 1002 is your DC. > You have stonith enabled, but no stonith devices. Disable stonith or get > and configure a stonith device (_please_ dont use ssh). > > 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l > 978). Retries in l 1018 and fails again in l 1035. > > Then, the cluster tries to start drbd on 1001 in l 1079, followed by a > bunch of kernel messages I don't understand (pretty sure _this_ is the > first problem you should address!), ending up in the drbd RA not able to > see the secondary state (1449) and considering the start failed. > > The RA code for this is > if do_drbdadm up $RESOURCE ; then > drbd_get_status > if [ "$DRBD_STATE_LOCAL" != "Secondary" ]; then > ocf_log err "$RESOURCE start: not in Secondary mode after start." > return $OCF_ERR_GENERIC > fi > ocf_log debug "$RESOURCE start: succeeded." > return $OCF_SUCCESS > else > ocf_log err "$RESOURCE: Failed to start up." > return $OCF_ERR_GENERIC > fi > > The cluster then successfully stops drbd again (l 1508-1511) and tries > to start the other clone instance (l 1523). > > Log says > RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device > is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 > disk /dev/sdb /dev/sdb internal --set-defaults --create-device > --on-io-error=pass_on' terminated with exit code 10 > Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in > Secondary mode after start. > > So this is interesting. Although "stop" (basically drbdadm down) > succeeded, the drbd device is still attached! > > Please try: > stop the cluster > drbdadm up $resource > drbdadm up $resource #again > echo $? > drbdadm down $resource > echo $? > cat /proc/drbd > > Btw: Does your userland match your kernel module version? > > To bring this to an end: The start of the second clone instance also > failed, so both instances are unrunnable on the node and no further > start is tried on 1002. > > Interestingly, then (could not see any attempt before), the cluster > wants to start drbd on node 1001, but it also fails and also gives those > kernel messages. In l 2001, each instance has a failed start on each node. > > So: Find out about those kernel messages. Can't help much on that > unfortunately, but there were some threads about things like that on > drbd-user recently. Maybe you can find answers to that problem there. > > And also: please verify returncodes of drbdadm in your case. Maybe > that's a drbd tools bug? (can't say for sure, for me, up on an alreay up > resource gives 1, which is ok). > > Regards > Dominik > > Jason Fitzpatrick wrote: > > it seems that I had the incorrect version of openais installed (from the > > fedora repo vs the HA one) > > > > I have corrected and the hb_report ran correctly using the following > > > > hb_report -u root -f 3pm /tmp/report > > > > Please see attached > > > > Thanks again > > > > Jason > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems