Re: [Linux-HA] DRBD in a 2 node cluster
Hi Dominik thanks again for the feedback, I had noticed some kernel opps's since the last kernel update that i and they seem to be pointing to DRBD, i will downgrade the kernel again and see if this improves things, re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9 but must have missed this bit. user land and kernel module all report the same version. I am on my way into the office now and I will apply the changes once there thanks again Jason 2009/2/12 Dominik Klein d...@in-telegence.net Right, this one looks better. I'll refer to nodes as 1001 and 1002. 1002 is your DC. You have stonith enabled, but no stonith devices. Disable stonith or get and configure a stonith device (_please_ dont use ssh). 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l 978). Retries in l 1018 and fails again in l 1035. Then, the cluster tries to start drbd on 1001 in l 1079, followed by a bunch of kernel messages I don't understand (pretty sure _this_ is the first problem you should address!), ending up in the drbd RA not able to see the secondary state (1449) and considering the start failed. The RA code for this is if do_drbdadm up $RESOURCE ; then drbd_get_status if [ $DRBD_STATE_LOCAL != Secondary ]; then ocf_log err $RESOURCE start: not in Secondary mode after start. return $OCF_ERR_GENERIC fi ocf_log debug $RESOURCE start: succeeded. return $OCF_SUCCESS else ocf_log err $RESOURCE: Failed to start up. return $OCF_ERR_GENERIC fi The cluster then successfully stops drbd again (l 1508-1511) and tries to start the other clone instance (l 1523). Log says RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in Secondary mode after start. So this is interesting. Although stop (basically drbdadm down) succeeded, the drbd device is still attached! Please try: stop the cluster drbdadm up $resource drbdadm up $resource #again echo $? drbdadm down $resource echo $? cat /proc/drbd Btw: Does your userland match your kernel module version? To bring this to an end: The start of the second clone instance also failed, so both instances are unrunnable on the node and no further start is tried on 1002. Interestingly, then (could not see any attempt before), the cluster wants to start drbd on node 1001, but it also fails and also gives those kernel messages. In l 2001, each instance has a failed start on each node. So: Find out about those kernel messages. Can't help much on that unfortunately, but there were some threads about things like that on drbd-user recently. Maybe you can find answers to that problem there. And also: please verify returncodes of drbdadm in your case. Maybe that's a drbd tools bug? (can't say for sure, for me, up on an alreay up resource gives 1, which is ok). Regards Dominik Jason Fitzpatrick wrote: it seems that I had the incorrect version of openais installed (from the fedora repo vs the HA one) I have corrected and the hb_report ran correctly using the following hb_report -u root -f 3pm /tmp/report Please see attached Thanks again Jason ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: Re: [Linux-HA] DRBD in a 2 node cluster
(Big Cheers and celebrations from this end!!!) Finally figured out what the problem was, it seems that the kernel oops were being caused by the 8.3 version of DRBD, once downgraded to 8.2.7 everything started to work as it should. primary / secondary automatic fail over is in place and resources are now following the DRBD master! Thanks a mill for all the help. Jason On Feb 12, 2009 8:48am, Jason Fitzpatrick jayfitzpatr...@gmail.com wrote: Hi Dominik thanks again for the feedback, I had noticed some kernel opps's since the last kernel update that i and they seem to be pointing to DRBD, i will downgrade the kernel again and see if this improves things, re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9 but must have missed this bit. user land and kernel module all report the same version. I am on my way into the office now and I will apply the changes once there thanks again Jason 2009/2/12 Dominik Klein d...@in-telegence.net Right, this one looks better. I'll refer to nodes as 1001 and 1002. 1002 is your DC. You have stonith enabled, but no stonith devices. Disable stonith or get and configure a stonith device (_please_ dont use ssh). 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l 978). Retries in l 1018 and fails again in l 1035. Then, the cluster tries to start drbd on 1001 in l 1079, followed by a bunch of kernel messages I don't understand (pretty sure _this_ is the first problem you should address!), ending up in the drbd RA not able to see the secondary state (1449) and considering the start failed. The RA code for this is if do_drbdadm up $RESOURCE ; then drbd_get_status if [ $DRBD_STATE_LOCAL != Secondary ]; then ocf_log err $RESOURCE start: not in Secondary mode after start. return $OCF_ERR_GENERIC fi ocf_log debug $RESOURCE start: succeeded. return $OCF_SUCCESS else ocf_log err $RESOURCE: Failed to start up. return $OCF_ERR_GENERIC fi The cluster then successfully stops drbd again (l 1508-1511) and tries to start the other clone instance (l 1523). Log says RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in Secondary mode after start. So this is interesting. Although stop (basically drbdadm down) succeeded, the drbd device is still attached! Please try: stop the cluster drbdadm up $resource drbdadm up $resource #again echo $? drbdadm down $resource echo $? cat /proc/drbd Btw: Does your userland match your kernel module version? To bring this to an end: The start of the second clone instance also failed, so both instances are unrunnable on the node and no further start is tried on 1002. Interestingly, then (could not see any attempt before), the cluster wants to start drbd on node 1001, but it also fails and also gives those kernel messages. In l 2001, each instance has a failed start on each node. So: Find out about those kernel messages. Can't help much on that unfortunately, but there were some threads about things like that on drbd-user recently. Maybe you can find answers to that problem there. And also: please verify returncodes of drbdadm in your case. Maybe that's a drbd tools bug? (can't say for sure, for me, up on an alreay up resource gives 1, which is ok). Regards Dominik Jason Fitzpatrick wrote: it seems that I had the incorrect version of openais installed (from the fedora repo vs the HA one) I have corrected and the hb_report ran correctly using the following hb_report -u root -f 3pm /tmp/report Please see attached Thanks again Jason ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD in a 2 node cluster
Did you look into the returncodes and eventually tell linbit about it? That would be a big issue. Regards Dominik jayfitzpatr...@gmail.com wrote: (Big Cheers and celebrations from this end!!!) Finally figured out what the problem was, it seems that the kernel oops were being caused by the 8.3 version of DRBD, once downgraded to 8.2.7 everything started to work as it should. primary / secondary automatic fail over is in place and resources are now following the DRBD master! Thanks a mill for all the help. Jason On Feb 12, 2009 8:48am, Jason Fitzpatrick jayfitzpatr...@gmail.com wrote: Hi Dominik thanks again for the feedback, I had noticed some kernel opps's since the last kernel update that i and they seem to be pointing to DRBD, i will downgrade the kernel again and see if this improves things, re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9 but must have missed this bit. user land and kernel module all report the same version. I am on my way into the office now and I will apply the changes once there thanks again Jason 2009/2/12 Dominik Klein d...@in-telegence.net Right, this one looks better. I'll refer to nodes as 1001 and 1002. 1002 is your DC. You have stonith enabled, but no stonith devices. Disable stonith or get and configure a stonith device (_please_ dont use ssh). 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l 978). Retries in l 1018 and fails again in l 1035. Then, the cluster tries to start drbd on 1001 in l 1079, followed by a bunch of kernel messages I don't understand (pretty sure _this_ is the first problem you should address!), ending up in the drbd RA not able to see the secondary state (1449) and considering the start failed. The RA code for this is if do_drbdadm up $RESOURCE ; then drbd_get_status if [ $DRBD_STATE_LOCAL != Secondary ]; then ocf_log err $RESOURCE start: not in Secondary mode after start. return $OCF_ERR_GENERIC fi ocf_log debug $RESOURCE start: succeeded. return $OCF_SUCCESS else ocf_log err $RESOURCE: Failed to start up. return $OCF_ERR_GENERIC fi The cluster then successfully stops drbd again (l 1508-1511) and tries to start the other clone instance (l 1523). Log says RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in Secondary mode after start. So this is interesting. Although stop (basically drbdadm down) succeeded, the drbd device is still attached! Please try: stop the cluster drbdadm up $resource drbdadm up $resource #again echo $? drbdadm down $resource echo $? cat /proc/drbd Btw: Does your userland match your kernel module version? To bring this to an end: The start of the second clone instance also failed, so both instances are unrunnable on the node and no further start is tried on 1002. Interestingly, then (could not see any attempt before), the cluster wants to start drbd on node 1001, but it also fails and also gives those kernel messages. In l 2001, each instance has a failed start on each node. So: Find out about those kernel messages. Can't help much on that unfortunately, but there were some threads about things like that on drbd-user recently. Maybe you can find answers to that problem there. And also: please verify returncodes of drbdadm in your case. Maybe that's a drbd tools bug? (can't say for sure, for me, up on an alreay up resource gives 1, which is ok). Regards Dominik Jason Fitzpatrick wrote: it seems that I had the incorrect version of openais installed (from the fedora repo vs the HA one) I have corrected and the hb_report ran correctly using the following hb_report -u root -f 3pm /tmp/report Please see attached Thanks again Jason ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: Re: [Linux-HA] DRBD in a 2 node cluster
return codes [r...@lpissan1001 ~]# service heartbeat stop Stopping High-Availability services: [ OK ] [r...@lpissan1001 ~]# drbdadm up Storage1 /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 [r...@lpissan1001 ~]# drbdadm down Storage1 [r...@lpissan1001 ~]# drbdadm up Storage1 [r...@lpissan1001 ~]# drbdadm up Storage1 /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 [r...@lpissan1001 ~]# echo $? 1 [r...@lpissan1001 ~]# drbdadm down Storage1 [r...@lpissan1001 ~]# echo $? 0 [r...@lpissan1001 ~]# cat /proc/drbd version: 8.3.0 (api:88/proto:86-89) GIT-hash: 9ba8b93e24d842f0dd3fb1f9b90e8348ddb95829 build by r...@lpissan1001.emea.leaseplancorp.net, 2009-02-12 13:13:30 0: cs:Unconfigured [r...@lpissan1001 ~]# Jason On Feb 12, 2009 12:24pm, Dominik Klein d...@in-telegence.net wrote: Did you look into the returncodes and eventually tell linbit about it? That would be a big issue. Regards Dominik jayfitzpatr...@gmail.com wrote: (Big Cheers and celebrations from this end!!!) Finally figured out what the problem was, it seems that the kernel oops were being caused by the 8.3 version of DRBD, once downgraded to 8.2.7 everything started to work as it should. primary / secondary automatic fail over is in place and resources are now following the DRBD master! Thanks a mill for all the help. Jason On Feb 12, 2009 8:48am, Jason Fitzpatrick jayfitzpatr...@gmail.com wrote: Hi Dominik thanks again for the feedback, I had noticed some kernel opps's since the last kernel update that i and they seem to be pointing to DRBD, i will downgrade the kernel again and see if this improves things, re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9 but must have missed this bit. user land and kernel module all report the same version. I am on my way into the office now and I will apply the changes once there thanks again Jason 2009/2/12 Dominik Klein d...@in-telegence.net Right, this one looks better. I'll refer to nodes as 1001 and 1002. 1002 is your DC. You have stonith enabled, but no stonith devices. Disable stonith or get and configure a stonith device (_please_ dont use ssh). 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l 978). Retries in l 1018 and fails again in l 1035. Then, the cluster tries to start drbd on 1001 in l 1079, followed by a bunch of kernel messages I don't understand (pretty sure _this_ is the first problem you should address!), ending up in the drbd RA not able to see the secondary state (1449) and considering the start failed. The RA code for this is if do_drbdadm up $RESOURCE ; then drbd_get_status if [ $DRBD_STATE_LOCAL != Secondary ]; then ocf_log err $RESOURCE start: not in Secondary mode after start. return $OCF_ERR_GENERIC fi ocf_log debug $RESOURCE start: succeeded. return $OCF_SUCCESS else ocf_log err $RESOURCE: Failed to start up. return $OCF_ERR_GENERIC fi The cluster then successfully stops drbd again (l 1508-1511) and tries to start the other clone instance (l 1523). Log says RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in Secondary mode after start. So this is interesting. Although stop (basically drbdadm down) succeeded, the drbd device is still attached! Please try: stop the cluster drbdadm up $resource drbdadm up $resource #again echo $? drbdadm down $resource echo $? cat /proc/drbd Btw: Does your userland match your kernel module version? To bring this to an end: The start of the second clone instance also failed, so both instances are unrunnable on the node and no further start is tried on 1002. Interestingly, then (could not see any attempt before), the cluster wants to start drbd on node 1001, but it also fails and also gives those kernel messages. In l 2001, each instance has a failed start on each node. So: Find out about those kernel messages. Can't help much on that unfortunately, but there were some threads about things like that on drbd-user recently. Maybe you can find answers to that problem there. And also: please verify returncodes of
Re: [Linux-HA] DRBD in a 2 node cluster
I have disabled the services and run drbdadm secondary all drbdadm detach all drbdadm down all service drbd stop before testing as far as I can see (cat /proc/drbd on both nodes) drbd is shutdown cat: /proc/drbd: No such file or directory I have taken the command that heartbeat is running (drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on') and run it against the nodes when heartbeat is not in control and this command will bring the resources online, but re-running this command will generate the error, so I am kind of leaning twords the command being run twice? Thanks Jason 2009/2/11 Dominik Klein d...@in-telegence.net Hi Jason any chance you started drbd at boot or the drbd device was active at the time you started the cluster resource? If so, read the introduction of the howto again and correct your setup. Jason Fitzpatrick wrote: Hi Dominik I have upgraded to HB 2.9xx and have been following the instructions that you provided (thanks for those) and have added a resource as follows crm configure primitive Storage1 ocf:heartbeat:drbd \ params drbd_resource=Storage1 \ op monitor role=Master interval=59s timeout=30s \ op monitor role=Slave interval=60s timeout=30s ms DRBD_Storage Storage1 \ meta clone-max=2 notify=true globally-unique=false target-role=stopped commit exit no errors are reported and the resource is visable from within the hb_gui when I try to bring the resource online with crm resource start DRBD_Storage I see the resource attempt to come online and then fail, it seems to be starting the services, changing the status of the devices to attached (from detached) but not setting any device to master the following is from the ha-log crmd[8020]: 2009/02/10_17:22:32 info: do_lrm_rsc_op: Performing key=7:166:0:b57f7f7c-4e2d-4134-9c14-b1a2b7db11a7 op=Storage1:1_start_0 ) lrmd[8016]: 2009/02/10_17:22:32 info: rsc:Storage1:1: start lrmd[8016]: 2009/02/10_17:22:32 info: RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 This looks like drbdadm up is failing because the device is already attached to the lower level storage device. Regards Dominik drbd[22270]:2009/02/10_17:22:32 ERROR: Storage1 start: not in Secondary mode after start. crmd[8020]: 2009/02/10_17:22:32 info: process_lrm_event: LRM operation Storage1:1_start_0 (call=189, rc=1, cib-update=380, confirmed=true) complete unknown e rror . I have checked the DRBD device Storage1 and it is in secondary mode after the start, and should I choose I can make it primary on either node Thanks Jason 2009/2/10 Jason Fitzpatrick jayfitzpatr...@gmail.com Thanks, This was the latest version in the Fedora Repos, I will upgrade and see what happens Jason 2009/2/10 Dominik Klein d...@in-telegence.net Jason Fitzpatrick wrote: Hi All I am having a hell of a time trying to get heartbeat to fail over my DRBD harddisk and am hoping for some help. I have a 2 node cluster, heartbeat is working as I am able to fail over IP Addresses and services successfully, but when I try to fail over my DRBD resource from secondary to primary I am hitting a brick wall, I can fail over the DRBD resource manually so I know that it does work at some level DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386 Please upgrade. Thats too old for reliable master/slave behaviour. Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read http://www.clusterlabs.org/wiki/Install for install notes. and using heartbeat-gui to configure Don't use the gui to configure complex (ie clone or master/slave) resources. Once you upgraded to the latest pacemaker, please refer to http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster configuration. Regards Dominik DRBD Resource is called Storage1, the 2 nodes are connected via 2 x-over cables (1 heartbeat 1 Replication) I have stripped down my config to the bare bones and tried every option that I can think off but know that I am missing something simple, I have attached my cib.xml but have removed domain names from the systems for privacy reasons Thanks in advance Jason cib admin_epoch=0 have_quorum=true ignore_dtd=false cib_feature_revision=2.0 num_peers=2 generated=true ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48 epoch=733 num_updates=1 cib-last-written=Mon Feb 9 18:31:19 2009 configuration crm_config cluster_property_set id=cib-bootstrap-options attributes nvpair id=cib-bootstrap-options-dc-version name=dc-version value=2.1.3-node:
Re: [Linux-HA] DRBD in a 2 node cluster
Hi Jason Jason Fitzpatrick wrote: I have disabled the services and run drbdadm secondary all drbdadm detach all drbdadm down all service drbd stop before testing as far as I can see (cat /proc/drbd on both nodes) drbd is shutdown cat: /proc/drbd: No such file or directory Good. I have taken the command that heartbeat is running (drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on') The RA actually runs drbdadm up, which translates into this. and run it against the nodes when heartbeat is not in control and this command will bring the resources online, but re-running this command will generate the error, so I am kind of leaning twords the command being run twice? Never seen the cluster do that. Please post your configuration and logs. hb_report should gather everything needed and put it into a nice .bz2 archive :) Regards Dominik Thanks Jason 2009/2/11 Dominik Klein d...@in-telegence.net Hi Jason any chance you started drbd at boot or the drbd device was active at the time you started the cluster resource? If so, read the introduction of the howto again and correct your setup. Jason Fitzpatrick wrote: Hi Dominik I have upgraded to HB 2.9xx and have been following the instructions that you provided (thanks for those) and have added a resource as follows crm configure primitive Storage1 ocf:heartbeat:drbd \ params drbd_resource=Storage1 \ op monitor role=Master interval=59s timeout=30s \ op monitor role=Slave interval=60s timeout=30s ms DRBD_Storage Storage1 \ meta clone-max=2 notify=true globally-unique=false target-role=stopped commit exit no errors are reported and the resource is visable from within the hb_gui when I try to bring the resource online with crm resource start DRBD_Storage I see the resource attempt to come online and then fail, it seems to be starting the services, changing the status of the devices to attached (from detached) but not setting any device to master the following is from the ha-log crmd[8020]: 2009/02/10_17:22:32 info: do_lrm_rsc_op: Performing key=7:166:0:b57f7f7c-4e2d-4134-9c14-b1a2b7db11a7 op=Storage1:1_start_0 ) lrmd[8016]: 2009/02/10_17:22:32 info: rsc:Storage1:1: start lrmd[8016]: 2009/02/10_17:22:32 info: RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 This looks like drbdadm up is failing because the device is already attached to the lower level storage device. Regards Dominik drbd[22270]:2009/02/10_17:22:32 ERROR: Storage1 start: not in Secondary mode after start. crmd[8020]: 2009/02/10_17:22:32 info: process_lrm_event: LRM operation Storage1:1_start_0 (call=189, rc=1, cib-update=380, confirmed=true) complete unknown e rror . I have checked the DRBD device Storage1 and it is in secondary mode after the start, and should I choose I can make it primary on either node Thanks Jason 2009/2/10 Jason Fitzpatrick jayfitzpatr...@gmail.com Thanks, This was the latest version in the Fedora Repos, I will upgrade and see what happens Jason 2009/2/10 Dominik Klein d...@in-telegence.net Jason Fitzpatrick wrote: Hi All I am having a hell of a time trying to get heartbeat to fail over my DRBD harddisk and am hoping for some help. I have a 2 node cluster, heartbeat is working as I am able to fail over IP Addresses and services successfully, but when I try to fail over my DRBD resource from secondary to primary I am hitting a brick wall, I can fail over the DRBD resource manually so I know that it does work at some level DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386 Please upgrade. Thats too old for reliable master/slave behaviour. Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read http://www.clusterlabs.org/wiki/Install for install notes. and using heartbeat-gui to configure Don't use the gui to configure complex (ie clone or master/slave) resources. Once you upgraded to the latest pacemaker, please refer to http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster configuration. Regards Dominik DRBD Resource is called Storage1, the 2 nodes are connected via 2 x-over cables (1 heartbeat 1 Replication) I have stripped down my config to the bare bones and tried every option that I can think off but know that I am missing something simple, I have attached my cib.xml but have removed domain names from the systems for privacy reasons Thanks in advance Jason cib admin_epoch=0 have_quorum=true ignore_dtd=false cib_feature_revision=2.0 num_peers=2 generated=true ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48 epoch=733 num_updates=1 cib-last-written=Mon Feb 9 18:31:19 2009 configuration crm_config
Re: [Linux-HA] DRBD in a 2 node cluster
Hi Dominik Thanks for the follow up, please find the file attached Jason 2009/2/11 Dominik Klein d...@in-telegence.net Hi Jason Jason Fitzpatrick wrote: I have disabled the services and run drbdadm secondary all drbdadm detach all drbdadm down all service drbd stop before testing as far as I can see (cat /proc/drbd on both nodes) drbd is shutdown cat: /proc/drbd: No such file or directory Good. I have taken the command that heartbeat is running (drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on') The RA actually runs drbdadm up, which translates into this. and run it against the nodes when heartbeat is not in control and this command will bring the resources online, but re-running this command will generate the error, so I am kind of leaning twords the command being run twice? Never seen the cluster do that. Please post your configuration and logs. hb_report should gather everything needed and put it into a nice .bz2 archive :) Regards Dominik Thanks Jason 2009/2/11 Dominik Klein d...@in-telegence.net Hi Jason any chance you started drbd at boot or the drbd device was active at the time you started the cluster resource? If so, read the introduction of the howto again and correct your setup. Jason Fitzpatrick wrote: Hi Dominik I have upgraded to HB 2.9xx and have been following the instructions that you provided (thanks for those) and have added a resource as follows crm configure primitive Storage1 ocf:heartbeat:drbd \ params drbd_resource=Storage1 \ op monitor role=Master interval=59s timeout=30s \ op monitor role=Slave interval=60s timeout=30s ms DRBD_Storage Storage1 \ meta clone-max=2 notify=true globally-unique=false target-role=stopped commit exit no errors are reported and the resource is visable from within the hb_gui when I try to bring the resource online with crm resource start DRBD_Storage I see the resource attempt to come online and then fail, it seems to be starting the services, changing the status of the devices to attached (from detached) but not setting any device to master the following is from the ha-log crmd[8020]: 2009/02/10_17:22:32 info: do_lrm_rsc_op: Performing key=7:166:0:b57f7f7c-4e2d-4134-9c14-b1a2b7db11a7 op=Storage1:1_start_0 ) lrmd[8016]: 2009/02/10_17:22:32 info: rsc:Storage1:1: start lrmd[8016]: 2009/02/10_17:22:32 info: RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 This looks like drbdadm up is failing because the device is already attached to the lower level storage device. Regards Dominik drbd[22270]:2009/02/10_17:22:32 ERROR: Storage1 start: not in Secondary mode after start. crmd[8020]: 2009/02/10_17:22:32 info: process_lrm_event: LRM operation Storage1:1_start_0 (call=189, rc=1, cib-update=380, confirmed=true) complete unknown e rror . I have checked the DRBD device Storage1 and it is in secondary mode after the start, and should I choose I can make it primary on either node Thanks Jason 2009/2/10 Jason Fitzpatrick jayfitzpatr...@gmail.com Thanks, This was the latest version in the Fedora Repos, I will upgrade and see what happens Jason 2009/2/10 Dominik Klein d...@in-telegence.net Jason Fitzpatrick wrote: Hi All I am having a hell of a time trying to get heartbeat to fail over my DRBD harddisk and am hoping for some help. I have a 2 node cluster, heartbeat is working as I am able to fail over IP Addresses and services successfully, but when I try to fail over my DRBD resource from secondary to primary I am hitting a brick wall, I can fail over the DRBD resource manually so I know that it does work at some level DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386 Please upgrade. Thats too old for reliable master/slave behaviour. Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read http://www.clusterlabs.org/wiki/Install for install notes. and using heartbeat-gui to configure Don't use the gui to configure complex (ie clone or master/slave) resources. Once you upgraded to the latest pacemaker, please refer to http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster configuration. Regards Dominik DRBD Resource is called Storage1, the 2 nodes are connected via 2 x-over cables (1 heartbeat 1 Replication) I have stripped down my config to the bare bones and tried every option that I can think off but know that I am missing something simple, I have attached my cib.xml but have removed domain names from the systems for privacy reasons Thanks in advance
Re: [Linux-HA] DRBD in a 2 node cluster
The archive only contains info for one node and the logfile is empty. Did you use appropriate -f time and does ssh work between the nodes? So far, nothing obvious to me except for the order between your FS and DRBD lacking the role definition, but that's not what your problem is about (yet *g*). Regards Dominik Jason Fitzpatrick wrote: Hi Dominik Thanks for the follow up, please find the file attached Jason ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD in a 2 node cluster
Hi Dominik I have re-run with the following command line hb_report -u root -f 1pm /tmp/report but get an error /usr/share/heartbeat/utillib.sh: line 288: 4790 Segmentation fault ccm_tool -p $1/$CCMTOOL_F 21 I have tried to clean up and re-run from both nodes but no difference. the new export is attached and I am working on the SSH trust issues now (ssh from one server to the other works fine via shortname / long name - no user name / password prompt) Thanks Jason 2009/2/11 Dominik Klein d...@in-telegence.net The archive only contains info for one node and the logfile is empty. Did you use appropriate -f time and does ssh work between the nodes? So far, nothing obvious to me except for the order between your FS and DRBD lacking the role definition, but that's not what your problem is about (yet *g*). Regards Dominik Jason Fitzpatrick wrote: Hi Dominik Thanks for the follow up, please find the file attached Jason ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD in a 2 node cluster
Right, this one looks better. I'll refer to nodes as 1001 and 1002. 1002 is your DC. You have stonith enabled, but no stonith devices. Disable stonith or get and configure a stonith device (_please_ dont use ssh). 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l 978). Retries in l 1018 and fails again in l 1035. Then, the cluster tries to start drbd on 1001 in l 1079, followed by a bunch of kernel messages I don't understand (pretty sure _this_ is the first problem you should address!), ending up in the drbd RA not able to see the secondary state (1449) and considering the start failed. The RA code for this is if do_drbdadm up $RESOURCE ; then drbd_get_status if [ $DRBD_STATE_LOCAL != Secondary ]; then ocf_log err $RESOURCE start: not in Secondary mode after start. return $OCF_ERR_GENERIC fi ocf_log debug $RESOURCE start: succeeded. return $OCF_SUCCESS else ocf_log err $RESOURCE: Failed to start up. return $OCF_ERR_GENERIC fi The cluster then successfully stops drbd again (l 1508-1511) and tries to start the other clone instance (l 1523). Log says RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in Secondary mode after start. So this is interesting. Although stop (basically drbdadm down) succeeded, the drbd device is still attached! Please try: stop the cluster drbdadm up $resource drbdadm up $resource #again echo $? drbdadm down $resource echo $? cat /proc/drbd Btw: Does your userland match your kernel module version? To bring this to an end: The start of the second clone instance also failed, so both instances are unrunnable on the node and no further start is tried on 1002. Interestingly, then (could not see any attempt before), the cluster wants to start drbd on node 1001, but it also fails and also gives those kernel messages. In l 2001, each instance has a failed start on each node. So: Find out about those kernel messages. Can't help much on that unfortunately, but there were some threads about things like that on drbd-user recently. Maybe you can find answers to that problem there. And also: please verify returncodes of drbdadm in your case. Maybe that's a drbd tools bug? (can't say for sure, for me, up on an alreay up resource gives 1, which is ok). Regards Dominik Jason Fitzpatrick wrote: it seems that I had the incorrect version of openais installed (from the fedora repo vs the HA one) I have corrected and the hb_report ran correctly using the following hb_report -u root -f 3pm /tmp/report Please see attached Thanks again Jason ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD in a 2 node cluster
Dominik Klein wrote: Right, this one looks better. I'll refer to nodes as 1001 and 1002. 1002 is your DC. You have stonith enabled, but no stonith devices. Disable stonith or get and configure a stonith device (_please_ dont use ssh). 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l 978). Retries in l 1018 and fails again in l 1035. Then, the cluster tries to start drbd on 1001 in l 1079, s/1001/1002 sorry ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD in a 2 node cluster
Jason Fitzpatrick wrote: Hi All I am having a hell of a time trying to get heartbeat to fail over my DRBD harddisk and am hoping for some help. I have a 2 node cluster, heartbeat is working as I am able to fail over IP Addresses and services successfully, but when I try to fail over my DRBD resource from secondary to primary I am hitting a brick wall, I can fail over the DRBD resource manually so I know that it does work at some level DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386 Please upgrade. Thats too old for reliable master/slave behaviour. Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read http://www.clusterlabs.org/wiki/Install for install notes. and using heartbeat-gui to configure Don't use the gui to configure complex (ie clone or master/slave) resources. Once you upgraded to the latest pacemaker, please refer to http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster configuration. Regards Dominik DRBD Resource is called Storage1, the 2 nodes are connected via 2 x-over cables (1 heartbeat 1 Replication) I have stripped down my config to the bare bones and tried every option that I can think off but know that I am missing something simple, I have attached my cib.xml but have removed domain names from the systems for privacy reasons Thanks in advance Jason cib admin_epoch=0 have_quorum=true ignore_dtd=false cib_feature_revision=2.0 num_peers=2 generated=true ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48 epoch=733 num_updates=1 cib-last-written=Mon Feb 9 18:31:19 2009 configuration crm_config cluster_property_set id=cib-bootstrap-options attributes nvpair id=cib-bootstrap-options-dc-version name=dc-version value=2.1.3-node: 552305612591183b1628baa5bc6e903e0f1e26a3/ nvpair name=last-lrm-refresh id=cib-bootstrap-options-last-lrm-refresh value=1234204278/ /attributes /cluster_property_set /crm_config nodes node id=df707752-d5fb-405a-8ca7-049e25a227b7 uname=lpissan1001 type=normal instance_attributes id=nodes-df707752-d5fb-405a-8ca7-049e25a227b7 attributes nvpair id=standby-df707752-d5fb-405a-8ca7-049e25a227b7 name=standby value=off/ /attributes /instance_attributes /node node id=9d8abc28-4fa3-408a-a695-fb36b0d67a48 uname=lpissan1002 type=normal instance_attributes id=nodes-9d8abc28-4fa3-408a-a695-fb36b0d67a48 attributes nvpair id=standby-9d8abc28-4fa3-408a-a695-fb36b0d67a48 name=standby value=off/ /attributes /instance_attributes /node /nodes resources master_slave id=Storage1 meta_attributes id=Storage1_meta_attrs attributes nvpair id=Storage1_metaattr_target_role name=target_role value=started/ nvpair id=Storage1_metaattr_clone_max name=clone_max value=2/ nvpair id=Storage1_metaattr_clone_node_max name=clone_node_max value=1/ nvpair id=Storage1_metaattr_master_max name=master_max value=1/ nvpair id=Storage1_metaattr_master_node_max name=master_node_max value=1/ nvpair id=Storage1_metaattr_notify name=notify value=true/ nvpair id=Storage1_metaattr_globally_unique name=globally_unique value=false/ /attributes /meta_attributes primitive id=Storage1 class=ocf type=drbd provider=heartbeat instance_attributes id=Storage1_instance_attrs attributes nvpair id=273a1bb2-4867-42dd-a9e5-7cebbf48ef3b name=drbd_resource value=Storage1/ /attributes /instance_attributes operations op id=9ddc0ce9-4090-4546-a7d5-787fe47de872 name=monitor description=master interval=29 timeout=10 start_delay=1m role=Master/ op id=56a7508f-fa42-46f8-9924-3b284cdb97f0 name=monitor description=slave interval=29 timeout=10 start_delay=1m role=Slave/ /operations /primitive /master_slave /resources constraints/ /configuration /cib ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD in a 2 node cluster
Thanks, This was the latest version in the Fedora Repos, I will upgrade and see what happens Jason 2009/2/10 Dominik Klein d...@in-telegence.net Jason Fitzpatrick wrote: Hi All I am having a hell of a time trying to get heartbeat to fail over my DRBD harddisk and am hoping for some help. I have a 2 node cluster, heartbeat is working as I am able to fail over IP Addresses and services successfully, but when I try to fail over my DRBD resource from secondary to primary I am hitting a brick wall, I can fail over the DRBD resource manually so I know that it does work at some level DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386 Please upgrade. Thats too old for reliable master/slave behaviour. Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read http://www.clusterlabs.org/wiki/Install for install notes. and using heartbeat-gui to configure Don't use the gui to configure complex (ie clone or master/slave) resources. Once you upgraded to the latest pacemaker, please refer to http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster configuration. Regards Dominik DRBD Resource is called Storage1, the 2 nodes are connected via 2 x-over cables (1 heartbeat 1 Replication) I have stripped down my config to the bare bones and tried every option that I can think off but know that I am missing something simple, I have attached my cib.xml but have removed domain names from the systems for privacy reasons Thanks in advance Jason cib admin_epoch=0 have_quorum=true ignore_dtd=false cib_feature_revision=2.0 num_peers=2 generated=true ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48 epoch=733 num_updates=1 cib-last-written=Mon Feb 9 18:31:19 2009 configuration crm_config cluster_property_set id=cib-bootstrap-options attributes nvpair id=cib-bootstrap-options-dc-version name=dc-version value=2.1.3-node: 552305612591183b1628baa5bc6e903e0f1e26a3/ nvpair name=last-lrm-refresh id=cib-bootstrap-options-last-lrm-refresh value=1234204278/ /attributes /cluster_property_set /crm_config nodes node id=df707752-d5fb-405a-8ca7-049e25a227b7 uname=lpissan1001 type=normal instance_attributes id=nodes-df707752-d5fb-405a-8ca7-049e25a227b7 attributes nvpair id=standby-df707752-d5fb-405a-8ca7-049e25a227b7 name=standby value=off/ /attributes /instance_attributes /node node id=9d8abc28-4fa3-408a-a695-fb36b0d67a48 uname=lpissan1002 type=normal instance_attributes id=nodes-9d8abc28-4fa3-408a-a695-fb36b0d67a48 attributes nvpair id=standby-9d8abc28-4fa3-408a-a695-fb36b0d67a48 name=standby value=off/ /attributes /instance_attributes /node /nodes resources master_slave id=Storage1 meta_attributes id=Storage1_meta_attrs attributes nvpair id=Storage1_metaattr_target_role name=target_role value=started/ nvpair id=Storage1_metaattr_clone_max name=clone_max value=2/ nvpair id=Storage1_metaattr_clone_node_max name=clone_node_max value=1/ nvpair id=Storage1_metaattr_master_max name=master_max value=1/ nvpair id=Storage1_metaattr_master_node_max name=master_node_max value=1/ nvpair id=Storage1_metaattr_notify name=notify value=true/ nvpair id=Storage1_metaattr_globally_unique name=globally_unique value=false/ /attributes /meta_attributes primitive id=Storage1 class=ocf type=drbd provider=heartbeat instance_attributes id=Storage1_instance_attrs attributes nvpair id=273a1bb2-4867-42dd-a9e5-7cebbf48ef3b name=drbd_resource value=Storage1/ /attributes /instance_attributes operations op id=9ddc0ce9-4090-4546-a7d5-787fe47de872 name=monitor description=master interval=29 timeout=10 start_delay=1m role=Master/ op id=56a7508f-fa42-46f8-9924-3b284cdb97f0 name=monitor description=slave interval=29 timeout=10 start_delay=1m role=Slave/ /operations /primitive /master_slave /resources constraints/ /configuration /cib ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list
Re: [Linux-HA] DRBD in a 2 node cluster
Hi Dominik I have upgraded to HB 2.9xx and have been following the instructions that you provided (thanks for those) and have added a resource as follows crm configure primitive Storage1 ocf:heartbeat:drbd \ params drbd_resource=Storage1 \ op monitor role=Master interval=59s timeout=30s \ op monitor role=Slave interval=60s timeout=30s ms DRBD_Storage Storage1 \ meta clone-max=2 notify=true globally-unique=false target-role=stopped commit exit no errors are reported and the resource is visable from within the hb_gui when I try to bring the resource online with crm resource start DRBD_Storage I see the resource attempt to come online and then fail, it seems to be starting the services, changing the status of the devices to attached (from detached) but not setting any device to master the following is from the ha-log crmd[8020]: 2009/02/10_17:22:32 info: do_lrm_rsc_op: Performing key=7:166:0:b57f7f7c-4e2d-4134-9c14-b1a2b7db11a7 op=Storage1:1_start_0 ) lrmd[8016]: 2009/02/10_17:22:32 info: rsc:Storage1:1: start lrmd[8016]: 2009/02/10_17:22:32 info: RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 drbd[22270]:2009/02/10_17:22:32 ERROR: Storage1 start: not in Secondary mode after start. crmd[8020]: 2009/02/10_17:22:32 info: process_lrm_event: LRM operation Storage1:1_start_0 (call=189, rc=1, cib-update=380, confirmed=true) complete unknown e rror . I have checked the DRBD device Storage1 and it is in secondary mode after the start, and should I choose I can make it primary on either node Thanks Jason 2009/2/10 Jason Fitzpatrick jayfitzpatr...@gmail.com Thanks, This was the latest version in the Fedora Repos, I will upgrade and see what happens Jason 2009/2/10 Dominik Klein d...@in-telegence.net Jason Fitzpatrick wrote: Hi All I am having a hell of a time trying to get heartbeat to fail over my DRBD harddisk and am hoping for some help. I have a 2 node cluster, heartbeat is working as I am able to fail over IP Addresses and services successfully, but when I try to fail over my DRBD resource from secondary to primary I am hitting a brick wall, I can fail over the DRBD resource manually so I know that it does work at some level DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386 Please upgrade. Thats too old for reliable master/slave behaviour. Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read http://www.clusterlabs.org/wiki/Install for install notes. and using heartbeat-gui to configure Don't use the gui to configure complex (ie clone or master/slave) resources. Once you upgraded to the latest pacemaker, please refer to http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster configuration. Regards Dominik DRBD Resource is called Storage1, the 2 nodes are connected via 2 x-over cables (1 heartbeat 1 Replication) I have stripped down my config to the bare bones and tried every option that I can think off but know that I am missing something simple, I have attached my cib.xml but have removed domain names from the systems for privacy reasons Thanks in advance Jason cib admin_epoch=0 have_quorum=true ignore_dtd=false cib_feature_revision=2.0 num_peers=2 generated=true ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48 epoch=733 num_updates=1 cib-last-written=Mon Feb 9 18:31:19 2009 configuration crm_config cluster_property_set id=cib-bootstrap-options attributes nvpair id=cib-bootstrap-options-dc-version name=dc-version value=2.1.3-node: 552305612591183b1628baa5bc6e903e0f1e26a3/ nvpair name=last-lrm-refresh id=cib-bootstrap-options-last-lrm-refresh value=1234204278/ /attributes /cluster_property_set /crm_config nodes node id=df707752-d5fb-405a-8ca7-049e25a227b7 uname=lpissan1001 type=normal instance_attributes id=nodes-df707752-d5fb-405a-8ca7-049e25a227b7 attributes nvpair id=standby-df707752-d5fb-405a-8ca7-049e25a227b7 name=standby value=off/ /attributes /instance_attributes /node node id=9d8abc28-4fa3-408a-a695-fb36b0d67a48 uname=lpissan1002 type=normal instance_attributes id=nodes-9d8abc28-4fa3-408a-a695-fb36b0d67a48 attributes nvpair id=standby-9d8abc28-4fa3-408a-a695-fb36b0d67a48 name=standby value=off/ /attributes /instance_attributes /node /nodes resources master_slave id=Storage1 meta_attributes id=Storage1_meta_attrs attributes nvpair id=Storage1_metaattr_target_role name=target_role value=started/
Re: [Linux-HA] DRBD in a 2 node cluster
Hi Jason any chance you started drbd at boot or the drbd device was active at the time you started the cluster resource? If so, read the introduction of the howto again and correct your setup. Jason Fitzpatrick wrote: Hi Dominik I have upgraded to HB 2.9xx and have been following the instructions that you provided (thanks for those) and have added a resource as follows crm configure primitive Storage1 ocf:heartbeat:drbd \ params drbd_resource=Storage1 \ op monitor role=Master interval=59s timeout=30s \ op monitor role=Slave interval=60s timeout=30s ms DRBD_Storage Storage1 \ meta clone-max=2 notify=true globally-unique=false target-role=stopped commit exit no errors are reported and the resource is visable from within the hb_gui when I try to bring the resource online with crm resource start DRBD_Storage I see the resource attempt to come online and then fail, it seems to be starting the services, changing the status of the devices to attached (from detached) but not setting any device to master the following is from the ha-log crmd[8020]: 2009/02/10_17:22:32 info: do_lrm_rsc_op: Performing key=7:166:0:b57f7f7c-4e2d-4134-9c14-b1a2b7db11a7 op=Storage1:1_start_0 ) lrmd[8016]: 2009/02/10_17:22:32 info: rsc:Storage1:1: start lrmd[8016]: 2009/02/10_17:22:32 info: RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 This looks like drbdadm up is failing because the device is already attached to the lower level storage device. Regards Dominik drbd[22270]:2009/02/10_17:22:32 ERROR: Storage1 start: not in Secondary mode after start. crmd[8020]: 2009/02/10_17:22:32 info: process_lrm_event: LRM operation Storage1:1_start_0 (call=189, rc=1, cib-update=380, confirmed=true) complete unknown e rror . I have checked the DRBD device Storage1 and it is in secondary mode after the start, and should I choose I can make it primary on either node Thanks Jason 2009/2/10 Jason Fitzpatrick jayfitzpatr...@gmail.com Thanks, This was the latest version in the Fedora Repos, I will upgrade and see what happens Jason 2009/2/10 Dominik Klein d...@in-telegence.net Jason Fitzpatrick wrote: Hi All I am having a hell of a time trying to get heartbeat to fail over my DRBD harddisk and am hoping for some help. I have a 2 node cluster, heartbeat is working as I am able to fail over IP Addresses and services successfully, but when I try to fail over my DRBD resource from secondary to primary I am hitting a brick wall, I can fail over the DRBD resource manually so I know that it does work at some level DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386 Please upgrade. Thats too old for reliable master/slave behaviour. Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read http://www.clusterlabs.org/wiki/Install for install notes. and using heartbeat-gui to configure Don't use the gui to configure complex (ie clone or master/slave) resources. Once you upgraded to the latest pacemaker, please refer to http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster configuration. Regards Dominik DRBD Resource is called Storage1, the 2 nodes are connected via 2 x-over cables (1 heartbeat 1 Replication) I have stripped down my config to the bare bones and tried every option that I can think off but know that I am missing something simple, I have attached my cib.xml but have removed domain names from the systems for privacy reasons Thanks in advance Jason cib admin_epoch=0 have_quorum=true ignore_dtd=false cib_feature_revision=2.0 num_peers=2 generated=true ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48 epoch=733 num_updates=1 cib-last-written=Mon Feb 9 18:31:19 2009 configuration crm_config cluster_property_set id=cib-bootstrap-options attributes nvpair id=cib-bootstrap-options-dc-version name=dc-version value=2.1.3-node: 552305612591183b1628baa5bc6e903e0f1e26a3/ nvpair name=last-lrm-refresh id=cib-bootstrap-options-last-lrm-refresh value=1234204278/ /attributes /cluster_property_set /crm_config nodes node id=df707752-d5fb-405a-8ca7-049e25a227b7 uname=lpissan1001 type=normal instance_attributes id=nodes-df707752-d5fb-405a-8ca7-049e25a227b7 attributes nvpair id=standby-df707752-d5fb-405a-8ca7-049e25a227b7 name=standby value=off/ /attributes /instance_attributes /node node id=9d8abc28-4fa3-408a-a695-fb36b0d67a48 uname=lpissan1002 type=normal instance_attributes id=nodes-9d8abc28-4fa3-408a-a695-fb36b0d67a48 attributes nvpair id=standby-9d8abc28-4fa3-408a-a695-fb36b0d67a48