Did you look into the returncodes and eventually tell linbit about it?

That would be a big issue.

Regards
Dominik

jayfitzpatr...@gmail.com wrote:
> (Big Cheers and celebrations from this end!!!)
> 
> Finally figured out what the problem was, it seems that the kernel oops
> were being caused by the 8.3 version of DRBD, once downgraded to 8.2.7
> everything started to work as it should. primary / secondary automatic
> fail over is in place and resources are now following the DRBD master!
> 
> Thanks a mill for all the help.
> 
> Jason
> 
> On Feb 12, 2009 8:48am, Jason Fitzpatrick <jayfitzpatr...@gmail.com> wrote:
>> Hi Dominik
>>
>> thanks again for the feedback,
>>
>> I had noticed some kernel opps's since the last kernel update that i and 
> they seem to be pointing to DRBD, i will downgrade the kernel again and
> see if this improves things,
>>
>>
>> re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9 
> but must have missed this bit.
>>
>> user land and kernel module all report the same version.
>>
>> I am on my way into the office now and I will apply the changes once
>> there
>>
>>
>> thanks again
>>
>> Jason
>>
>> 2009/2/12 Dominik Klein d...@in-telegence.net>
>>
>> Right, this one looks better.
>>
>>
>>
>> I'll refer to nodes as 1001 and 1002.
>>
>>
>>
>> 1002 is your DC.
>>
>> You have stonith enabled, but no stonith devices. Disable stonith or get
>>
>> and configure a stonith device (_please_ dont use ssh).
>>
>>
>>
>> 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l
>>
>> 978). Retries in l 1018 and fails again in l 1035.
>>
>>
>>
>> Then, the cluster tries to start drbd on 1001 in l 1079, followed by a
>>
>> bunch of kernel messages I don't understand (pretty sure _this_ is the
>>
>> first problem you should address!), ending up in the drbd RA not able to
>>
>> see the secondary state (1449) and considering the start failed.
>>
>>
>>
>> The RA code for this is
>>
>> if do_drbdadm up $RESOURCE ; then
>>
>> drbd_get_status
>>
>> if [ "$DRBD_STATE_LOCAL" != "Secondary" ]; then
>>
>> ocf_log err "$RESOURCE start: not in Secondary mode after start."
>>
>> return $OCF_ERR_GENERIC
>>
>> fi
>>
>> ocf_log debug "$RESOURCE start: succeeded."
>>
>> return $OCF_SUCCESS
>>
>> else
>>
>> ocf_log err "$RESOURCE: Failed to start up."
>>
>> return $OCF_ERR_GENERIC
>>
>> fi
>>
>>
>>
>> The cluster then successfully stops drbd again (l 1508-1511) and tries
>>
>> to start the other clone instance (l 1523).
>>
>>
>>
>> Log says
>>
>> RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device
>>
>> is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0
>>
>> disk /dev/sdb /dev/sdb internal --set-defaults --create-device
>>
>> --on-io-error=pass_on' terminated with exit code 10
>>
>>
>> Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in
>>
>> Secondary mode after start.
>>
>>
>>
>> So this is interesting. Although "stop" (basically drbdadm down)
>>
>> succeeded, the drbd device is still attached!
>>
>>
>>
>> Please try:
>>
>> stop the cluster
>>
>> drbdadm up $resource
>>
>> drbdadm up $resource #again
>>
>> echo $?
>>
>> drbdadm down $resource
>>
>> echo $?
>>
>> cat /proc/drbd
>>
>>
>>
>> Btw: Does your userland match your kernel module version?
>>
>>
>>
>> To bring this to an end: The start of the second clone instance also
>>
>> failed, so both instances are unrunnable on the node and no further
>>
>> start is tried on 1002.
>>
>>
>>
>> Interestingly, then (could not see any attempt before), the cluster
>>
>> wants to start drbd on node 1001, but it also fails and also gives those
>>
>> kernel messages. In l 2001, each instance has a failed start on each
>> node.
>>
>>
>>
>> So: Find out about those kernel messages. Can't help much on that
>>
>> unfortunately, but there were some threads about things like that on
>>
>> drbd-user recently. Maybe you can find answers to that problem there.
>>
>>
>>
>> And also: please verify returncodes of drbdadm in your case. Maybe
>>
>> that's a drbd tools bug? (can't say for sure, for me, up on an alreay up
>>
>> resource gives 1, which is ok).
>>
>>
>>
>> Regards
>>
>> Dominik
>>
>>
>>
>> Jason Fitzpatrick wrote:
>>
>>
>> > it seems that I had the incorrect version of openais installed (from
>> the
>>
>> > fedora repo vs the HA one)
>>
>> >
>>
>> > I have corrected and the hb_report ran correctly using the following
>>
>> >
>>
>> > hb_report -u root -f 3pm /tmp/report
>>
>> >
>>
>> > Please see attached
>>
>> >
>>
>> > Thanks again
>>
>> >
>>
>> > Jason
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>> Linux-HA mailing list
>>
>> Linux-HA@lists.linux-ha.org
>>
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>
>> See also: http://linux-ha.org/ReportingProblems
>>
>>
>>
>>
>>
>>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to