Hi Dominik

thanks again for the feedback,

I had noticed some kernel opps's since the last kernel update that i and
they seem to be pointing to DRBD,  i will downgrade the kernel again and see
if this improves things,

re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9 but
must have missed this bit.

user land and kernel module all report the same version.

I am on my way into the office now and I will apply the changes once there

thanks again

Jason

2009/2/12 Dominik Klein <d...@in-telegence.net>

> Right, this one looks better.
>
> I'll refer to nodes as 1001 and 1002.
>
> 1002 is your DC.
> You have stonith enabled, but no stonith devices. Disable stonith or get
> and configure a stonith device (_please_ dont use ssh).
>
> 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l
> 978). Retries in l 1018 and fails again in l 1035.
>
> Then, the cluster tries to start drbd on 1001 in l 1079, followed by a
> bunch of kernel messages I don't understand (pretty sure _this_ is the
> first problem you should address!), ending up in the drbd RA not able to
> see the secondary state (1449) and considering the start failed.
>
> The RA code for this is
> if do_drbdadm up $RESOURCE ; then
>  drbd_get_status
>  if [ "$DRBD_STATE_LOCAL" != "Secondary" ]; then
>  ocf_log err "$RESOURCE start: not in Secondary mode after start."
>  return $OCF_ERR_GENERIC
>  fi
>  ocf_log debug "$RESOURCE start: succeeded."
>  return $OCF_SUCCESS
>  else
>  ocf_log err "$RESOURCE: Failed to start up."
>  return $OCF_ERR_GENERIC
> fi
>
> The cluster then successfully stops drbd again (l 1508-1511) and tries
> to start the other clone instance (l 1523).
>
> Log says
> RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device
> is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0
> disk /dev/sdb /dev/sdb internal --set-defaults --create-device
> --on-io-error=pass_on' terminated with exit code 10
> Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in
> Secondary mode after start.
>
> So this is interesting. Although "stop" (basically drbdadm down)
> succeeded, the drbd device is still attached!
>
> Please try:
> stop the cluster
> drbdadm up $resource
> drbdadm up $resource #again
> echo $?
> drbdadm down $resource
> echo $?
> cat /proc/drbd
>
> Btw: Does your userland match your kernel module version?
>
> To bring this to an end: The start of the second clone instance also
> failed, so both instances are unrunnable on the node and no further
> start is tried on 1002.
>
> Interestingly, then (could not see any attempt before), the cluster
> wants to start drbd on node 1001, but it also fails and also gives those
> kernel messages. In l 2001, each instance has a failed start on each node.
>
> So: Find out about those kernel messages. Can't help much on that
> unfortunately, but there were some threads about things like that on
> drbd-user recently. Maybe you can find answers to that problem there.
>
> And also: please verify returncodes of drbdadm in your case. Maybe
> that's a drbd tools bug? (can't say for sure, for me, up on an alreay up
> resource gives 1, which is ok).
>
> Regards
> Dominik
>
> Jason Fitzpatrick wrote:
> > it seems that I had the incorrect version of openais installed (from the
> > fedora repo vs the HA one)
> >
> > I have corrected and the hb_report ran correctly using the following
> >
> >  hb_report -u root -f 3pm /tmp/report
> >
> > Please see attached
> >
> > Thanks again
> >
> > Jason
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to