(Big Cheers and celebrations from this end!!!)
Finally figured out what the problem was, it seems that the kernel oops
were being caused by the 8.3 version of DRBD, once downgraded to 8.2.7
everything started to work as it should. primary / secondary automatic fail
over is in place and resources are now following the DRBD master!
Thanks a mill for all the help.
Jason
On Feb 12, 2009 8:48am, Jason Fitzpatrick <jayfitzpatr...@gmail.com> wrote:
Hi Dominik
thanks again for the feedback,
I had noticed some kernel opps's since the last kernel update that i and
they seem to be pointing to DRBD, i will downgrade the kernel again and see
if this improves things,
re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9
but must have missed this bit.
user land and kernel module all report the same version.
I am on my way into the office now and I will apply the changes once there
thanks again
Jason
2009/2/12 Dominik Klein d...@in-telegence.net>
Right, this one looks better.
I'll refer to nodes as 1001 and 1002.
1002 is your DC.
You have stonith enabled, but no stonith devices. Disable stonith or get
and configure a stonith device (_please_ dont use ssh).
1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l
978). Retries in l 1018 and fails again in l 1035.
Then, the cluster tries to start drbd on 1001 in l 1079, followed by a
bunch of kernel messages I don't understand (pretty sure _this_ is the
first problem you should address!), ending up in the drbd RA not able to
see the secondary state (1449) and considering the start failed.
The RA code for this is
if do_drbdadm up $RESOURCE ; then
drbd_get_status
if [ "$DRBD_STATE_LOCAL" != "Secondary" ]; then
ocf_log err "$RESOURCE start: not in Secondary mode after start."
return $OCF_ERR_GENERIC
fi
ocf_log debug "$RESOURCE start: succeeded."
return $OCF_SUCCESS
else
ocf_log err "$RESOURCE: Failed to start up."
return $OCF_ERR_GENERIC
fi
The cluster then successfully stops drbd again (l 1508-1511) and tries
to start the other clone instance (l 1523).
Log says
RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device
is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0
disk /dev/sdb /dev/sdb internal --set-defaults --create-device
--on-io-error=pass_on' terminated with exit code 10
Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in
Secondary mode after start.
So this is interesting. Although "stop" (basically drbdadm down)
succeeded, the drbd device is still attached!
Please try:
stop the cluster
drbdadm up $resource
drbdadm up $resource #again
echo $?
drbdadm down $resource
echo $?
cat /proc/drbd
Btw: Does your userland match your kernel module version?
To bring this to an end: The start of the second clone instance also
failed, so both instances are unrunnable on the node and no further
start is tried on 1002.
Interestingly, then (could not see any attempt before), the cluster
wants to start drbd on node 1001, but it also fails and also gives those
kernel messages. In l 2001, each instance has a failed start on each node.
So: Find out about those kernel messages. Can't help much on that
unfortunately, but there were some threads about things like that on
drbd-user recently. Maybe you can find answers to that problem there.
And also: please verify returncodes of drbdadm in your case. Maybe
that's a drbd tools bug? (can't say for sure, for me, up on an alreay up
resource gives 1, which is ok).
Regards
Dominik
Jason Fitzpatrick wrote:
> it seems that I had the incorrect version of openais installed (from the
> fedora repo vs the HA one)
>
> I have corrected and the hb_report ran correctly using the following
>
> hb_report -u root -f 3pm /tmp/report
>
> Please see attached
>
> Thanks again
>
> Jason
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems