(Big Cheers and celebrations from this end!!!)

Finally figured out what the problem was, it seems that the kernel oops were being caused by the 8.3 version of DRBD, once downgraded to 8.2.7 everything started to work as it should. primary / secondary automatic fail over is in place and resources are now following the DRBD master!

Thanks a mill for all the help.


On Feb 12, 2009 8:48am, Jason Fitzpatrick <jayfitzpatr...@gmail.com> wrote:
Hi Dominik

thanks again for the feedback,

I had noticed some kernel opps's since the last kernel update that i and
they seem to be pointing to DRBD, i will downgrade the kernel again and see if this improves things,

re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9
but must have missed this bit.

user land and kernel module all report the same version.

I am on my way into the office now and I will apply the changes once there

thanks again


2009/2/12 Dominik Klein d...@in-telegence.net>

Right, this one looks better.

I'll refer to nodes as 1001 and 1002.

1002 is your DC.

You have stonith enabled, but no stonith devices. Disable stonith or get

and configure a stonith device (_please_ dont use ssh).

1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l

978). Retries in l 1018 and fails again in l 1035.

Then, the cluster tries to start drbd on 1001 in l 1079, followed by a

bunch of kernel messages I don't understand (pretty sure _this_ is the

first problem you should address!), ending up in the drbd RA not able to

see the secondary state (1449) and considering the start failed.

The RA code for this is

if do_drbdadm up $RESOURCE ; then


if [ "$DRBD_STATE_LOCAL" != "Secondary" ]; then

ocf_log err "$RESOURCE start: not in Secondary mode after start."



ocf_log debug "$RESOURCE start: succeeded."



ocf_log err "$RESOURCE: Failed to start up."



The cluster then successfully stops drbd again (l 1508-1511) and tries

to start the other clone instance (l 1523).

Log says

RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device

is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0

disk /dev/sdb /dev/sdb internal --set-defaults --create-device

--on-io-error=pass_on' terminated with exit code 10

Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in

Secondary mode after start.

So this is interesting. Although "stop" (basically drbdadm down)

succeeded, the drbd device is still attached!

Please try:

stop the cluster

drbdadm up $resource

drbdadm up $resource #again

echo $?

drbdadm down $resource

echo $?

cat /proc/drbd

Btw: Does your userland match your kernel module version?

To bring this to an end: The start of the second clone instance also

failed, so both instances are unrunnable on the node and no further

start is tried on 1002.

Interestingly, then (could not see any attempt before), the cluster

wants to start drbd on node 1001, but it also fails and also gives those

kernel messages. In l 2001, each instance has a failed start on each node.

So: Find out about those kernel messages. Can't help much on that

unfortunately, but there were some threads about things like that on

drbd-user recently. Maybe you can find answers to that problem there.

And also: please verify returncodes of drbdadm in your case. Maybe

that's a drbd tools bug? (can't say for sure, for me, up on an alreay up

resource gives 1, which is ok).



Jason Fitzpatrick wrote:

> it seems that I had the incorrect version of openais installed (from the

> fedora repo vs the HA one)


> I have corrected and the hb_report ran correctly using the following


> hb_report -u root -f 3pm /tmp/report


> Please see attached


> Thanks again


> Jason


Linux-HA mailing list



See also: http://linux-ha.org/ReportingProblems

Linux-HA mailing list
See also: http://linux-ha.org/ReportingProblems

Reply via email to