Right, this one looks better.

I'll refer to nodes as 1001 and 1002.

1002 is your DC.
You have stonith enabled, but no stonith devices. Disable stonith or get
and configure a stonith device (_please_ dont use ssh).

1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l
978). Retries in l 1018 and fails again in l 1035.

Then, the cluster tries to start drbd on 1001 in l 1079, followed by a
bunch of kernel messages I don't understand (pretty sure _this_ is the
first problem you should address!), ending up in the drbd RA not able to
see the secondary state (1449) and considering the start failed.

The RA code for this is
if do_drbdadm up $RESOURCE ; then
 drbd_get_status
 if [ "$DRBD_STATE_LOCAL" != "Secondary" ]; then
  ocf_log err "$RESOURCE start: not in Secondary mode after start."
  return $OCF_ERR_GENERIC
 fi
 ocf_log debug "$RESOURCE start: succeeded."
 return $OCF_SUCCESS
 else
  ocf_log err "$RESOURCE: Failed to start up."
  return $OCF_ERR_GENERIC
fi

The cluster then successfully stops drbd again (l 1508-1511) and tries
to start the other clone instance (l 1523).

Log says
RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device
is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0
disk /dev/sdb /dev/sdb internal --set-defaults --create-device
--on-io-error=pass_on' terminated with exit code 10
Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in
Secondary mode after start.

So this is interesting. Although "stop" (basically drbdadm down)
succeeded, the drbd device is still attached!

Please try:
stop the cluster
drbdadm up $resource
drbdadm up $resource #again
echo $?
drbdadm down $resource
echo $?
cat /proc/drbd

Btw: Does your userland match your kernel module version?

To bring this to an end: The start of the second clone instance also
failed, so both instances are unrunnable on the node and no further
start is tried on 1002.

Interestingly, then (could not see any attempt before), the cluster
wants to start drbd on node 1001, but it also fails and also gives those
kernel messages. In l 2001, each instance has a failed start on each node.

So: Find out about those kernel messages. Can't help much on that
unfortunately, but there were some threads about things like that on
drbd-user recently. Maybe you can find answers to that problem there.

And also: please verify returncodes of drbdadm in your case. Maybe
that's a drbd tools bug? (can't say for sure, for me, up on an alreay up
resource gives 1, which is ok).

Regards
Dominik

Jason Fitzpatrick wrote:
> it seems that I had the incorrect version of openais installed (from the
> fedora repo vs the HA one)
> 
> I have corrected and the hb_report ran correctly using the following
> 
>  hb_report -u root -f 3pm /tmp/report
> 
> Please see attached
> 
> Thanks again
> 
> Jason

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to