Did you look into the returncodes and eventually tell linbit about it? That would be a big issue.
Regards Dominik jayfitzpatr...@gmail.com wrote: > (Big Cheers and celebrations from this end!!!) > > Finally figured out what the problem was, it seems that the kernel oops > were being caused by the 8.3 version of DRBD, once downgraded to 8.2.7 > everything started to work as it should. primary / secondary automatic > fail over is in place and resources are now following the DRBD master! > > Thanks a mill for all the help. > > Jason > > On Feb 12, 2009 8:48am, Jason Fitzpatrick <jayfitzpatr...@gmail.com> wrote: >> Hi Dominik >> >> thanks again for the feedback, >> >> I had noticed some kernel opps's since the last kernel update that i and > they seem to be pointing to DRBD, i will downgrade the kernel again and > see if this improves things, >> >> >> re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9 > but must have missed this bit. >> >> user land and kernel module all report the same version. >> >> I am on my way into the office now and I will apply the changes once >> there >> >> >> thanks again >> >> Jason >> >> 2009/2/12 Dominik Klein d...@in-telegence.net> >> >> Right, this one looks better. >> >> >> >> I'll refer to nodes as 1001 and 1002. >> >> >> >> 1002 is your DC. >> >> You have stonith enabled, but no stonith devices. Disable stonith or get >> >> and configure a stonith device (_please_ dont use ssh). >> >> >> >> 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l >> >> 978). Retries in l 1018 and fails again in l 1035. >> >> >> >> Then, the cluster tries to start drbd on 1001 in l 1079, followed by a >> >> bunch of kernel messages I don't understand (pretty sure _this_ is the >> >> first problem you should address!), ending up in the drbd RA not able to >> >> see the secondary state (1449) and considering the start failed. >> >> >> >> The RA code for this is >> >> if do_drbdadm up $RESOURCE ; then >> >> drbd_get_status >> >> if [ "$DRBD_STATE_LOCAL" != "Secondary" ]; then >> >> ocf_log err "$RESOURCE start: not in Secondary mode after start." >> >> return $OCF_ERR_GENERIC >> >> fi >> >> ocf_log debug "$RESOURCE start: succeeded." >> >> return $OCF_SUCCESS >> >> else >> >> ocf_log err "$RESOURCE: Failed to start up." >> >> return $OCF_ERR_GENERIC >> >> fi >> >> >> >> The cluster then successfully stops drbd again (l 1508-1511) and tries >> >> to start the other clone instance (l 1523). >> >> >> >> Log says >> >> RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device >> >> is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 >> >> disk /dev/sdb /dev/sdb internal --set-defaults --create-device >> >> --on-io-error=pass_on' terminated with exit code 10 >> >> >> Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in >> >> Secondary mode after start. >> >> >> >> So this is interesting. Although "stop" (basically drbdadm down) >> >> succeeded, the drbd device is still attached! >> >> >> >> Please try: >> >> stop the cluster >> >> drbdadm up $resource >> >> drbdadm up $resource #again >> >> echo $? >> >> drbdadm down $resource >> >> echo $? >> >> cat /proc/drbd >> >> >> >> Btw: Does your userland match your kernel module version? >> >> >> >> To bring this to an end: The start of the second clone instance also >> >> failed, so both instances are unrunnable on the node and no further >> >> start is tried on 1002. >> >> >> >> Interestingly, then (could not see any attempt before), the cluster >> >> wants to start drbd on node 1001, but it also fails and also gives those >> >> kernel messages. In l 2001, each instance has a failed start on each >> node. >> >> >> >> So: Find out about those kernel messages. Can't help much on that >> >> unfortunately, but there were some threads about things like that on >> >> drbd-user recently. Maybe you can find answers to that problem there. >> >> >> >> And also: please verify returncodes of drbdadm in your case. Maybe >> >> that's a drbd tools bug? (can't say for sure, for me, up on an alreay up >> >> resource gives 1, which is ok). >> >> >> >> Regards >> >> Dominik >> >> >> >> Jason Fitzpatrick wrote: >> >> >> > it seems that I had the incorrect version of openais installed (from >> the >> >> > fedora repo vs the HA one) >> >> > >> >> > I have corrected and the hb_report ran correctly using the following >> >> > >> >> > hb_report -u root -f 3pm /tmp/report >> >> > >> >> > Please see attached >> >> > >> >> > Thanks again >> >> > >> >> > Jason >> >> >> >> >> >> _______________________________________________ >> >> Linux-HA mailing list >> >> Linux-HA@lists.linux-ha.org >> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> >> See also: http://linux-ha.org/ReportingProblems >> >> >> >> >> >> > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems