Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-12 Thread Jason Fitzpatrick
Hi Dominik

thanks again for the feedback,

I had noticed some kernel opps's since the last kernel update that i and
they seem to be pointing to DRBD,  i will downgrade the kernel again and see
if this improves things,

re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9 but
must have missed this bit.

user land and kernel module all report the same version.

I am on my way into the office now and I will apply the changes once there

thanks again

Jason

2009/2/12 Dominik Klein d...@in-telegence.net

 Right, this one looks better.

 I'll refer to nodes as 1001 and 1002.

 1002 is your DC.
 You have stonith enabled, but no stonith devices. Disable stonith or get
 and configure a stonith device (_please_ dont use ssh).

 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l
 978). Retries in l 1018 and fails again in l 1035.

 Then, the cluster tries to start drbd on 1001 in l 1079, followed by a
 bunch of kernel messages I don't understand (pretty sure _this_ is the
 first problem you should address!), ending up in the drbd RA not able to
 see the secondary state (1449) and considering the start failed.

 The RA code for this is
 if do_drbdadm up $RESOURCE ; then
  drbd_get_status
  if [ $DRBD_STATE_LOCAL != Secondary ]; then
  ocf_log err $RESOURCE start: not in Secondary mode after start.
  return $OCF_ERR_GENERIC
  fi
  ocf_log debug $RESOURCE start: succeeded.
  return $OCF_SUCCESS
  else
  ocf_log err $RESOURCE: Failed to start up.
  return $OCF_ERR_GENERIC
 fi

 The cluster then successfully stops drbd again (l 1508-1511) and tries
 to start the other clone instance (l 1523).

 Log says
 RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device
 is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0
 disk /dev/sdb /dev/sdb internal --set-defaults --create-device
 --on-io-error=pass_on' terminated with exit code 10
 Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in
 Secondary mode after start.

 So this is interesting. Although stop (basically drbdadm down)
 succeeded, the drbd device is still attached!

 Please try:
 stop the cluster
 drbdadm up $resource
 drbdadm up $resource #again
 echo $?
 drbdadm down $resource
 echo $?
 cat /proc/drbd

 Btw: Does your userland match your kernel module version?

 To bring this to an end: The start of the second clone instance also
 failed, so both instances are unrunnable on the node and no further
 start is tried on 1002.

 Interestingly, then (could not see any attempt before), the cluster
 wants to start drbd on node 1001, but it also fails and also gives those
 kernel messages. In l 2001, each instance has a failed start on each node.

 So: Find out about those kernel messages. Can't help much on that
 unfortunately, but there were some threads about things like that on
 drbd-user recently. Maybe you can find answers to that problem there.

 And also: please verify returncodes of drbdadm in your case. Maybe
 that's a drbd tools bug? (can't say for sure, for me, up on an alreay up
 resource gives 1, which is ok).

 Regards
 Dominik

 Jason Fitzpatrick wrote:
  it seems that I had the incorrect version of openais installed (from the
  fedora repo vs the HA one)
 
  I have corrected and the hb_report ran correctly using the following
 
   hb_report -u root -f 3pm /tmp/report
 
  Please see attached
 
  Thanks again
 
  Jason

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-12 Thread jayfitzpatrick

(Big Cheers and celebrations from this end!!!)

Finally figured out what the problem was, it seems that the kernel oops  
were being caused by the 8.3 version of DRBD, once downgraded to 8.2.7  
everything started to work as it should. primary / secondary automatic fail  
over is in place and resources are now following the DRBD master!


Thanks a mill for all the help.

Jason

On Feb 12, 2009 8:48am, Jason Fitzpatrick jayfitzpatr...@gmail.com wrote:

Hi Dominik

thanks again for the feedback,

I had noticed some kernel opps's since the last kernel update that i and  
they seem to be pointing to DRBD, i will downgrade the kernel again and see  
if this improves things,



re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9  

but must have missed this bit.


user land and kernel module all report the same version.

I am on my way into the office now and I will apply the changes once there


thanks again

Jason

2009/2/12 Dominik Klein d...@in-telegence.net

Right, this one looks better.



I'll refer to nodes as 1001 and 1002.



1002 is your DC.

You have stonith enabled, but no stonith devices. Disable stonith or get

and configure a stonith device (_please_ dont use ssh).



1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l

978). Retries in l 1018 and fails again in l 1035.



Then, the cluster tries to start drbd on 1001 in l 1079, followed by a

bunch of kernel messages I don't understand (pretty sure _this_ is the

first problem you should address!), ending up in the drbd RA not able to

see the secondary state (1449) and considering the start failed.



The RA code for this is

if do_drbdadm up $RESOURCE ; then

drbd_get_status

if [ $DRBD_STATE_LOCAL != Secondary ]; then

ocf_log err $RESOURCE start: not in Secondary mode after start.

return $OCF_ERR_GENERIC

fi

ocf_log debug $RESOURCE start: succeeded.

return $OCF_SUCCESS

else

ocf_log err $RESOURCE: Failed to start up.

return $OCF_ERR_GENERIC

fi



The cluster then successfully stops drbd again (l 1508-1511) and tries

to start the other clone instance (l 1523).



Log says

RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device

is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0

disk /dev/sdb /dev/sdb internal --set-defaults --create-device

--on-io-error=pass_on' terminated with exit code 10


Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in

Secondary mode after start.



So this is interesting. Although stop (basically drbdadm down)

succeeded, the drbd device is still attached!



Please try:

stop the cluster

drbdadm up $resource

drbdadm up $resource #again

echo $?

drbdadm down $resource

echo $?

cat /proc/drbd



Btw: Does your userland match your kernel module version?



To bring this to an end: The start of the second clone instance also

failed, so both instances are unrunnable on the node and no further

start is tried on 1002.



Interestingly, then (could not see any attempt before), the cluster

wants to start drbd on node 1001, but it also fails and also gives those

kernel messages. In l 2001, each instance has a failed start on each node.



So: Find out about those kernel messages. Can't help much on that

unfortunately, but there were some threads about things like that on

drbd-user recently. Maybe you can find answers to that problem there.



And also: please verify returncodes of drbdadm in your case. Maybe

that's a drbd tools bug? (can't say for sure, for me, up on an alreay up

resource gives 1, which is ok).



Regards

Dominik



Jason Fitzpatrick wrote:


 it seems that I had the incorrect version of openais installed (from the

 fedora repo vs the HA one)



 I have corrected and the hb_report ran correctly using the following



 hb_report -u root -f 3pm /tmp/report



 Please see attached



 Thanks again



 Jason





___

Linux-HA mailing list

Linux-HA@lists.linux-ha.org

http://lists.linux-ha.org/mailman/listinfo/linux-ha

See also: http://linux-ha.org/ReportingProblems







___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-12 Thread Dominik Klein
Did you look into the returncodes and eventually tell linbit about it?

That would be a big issue.

Regards
Dominik

jayfitzpatr...@gmail.com wrote:
 (Big Cheers and celebrations from this end!!!)
 
 Finally figured out what the problem was, it seems that the kernel oops
 were being caused by the 8.3 version of DRBD, once downgraded to 8.2.7
 everything started to work as it should. primary / secondary automatic
 fail over is in place and resources are now following the DRBD master!
 
 Thanks a mill for all the help.
 
 Jason
 
 On Feb 12, 2009 8:48am, Jason Fitzpatrick jayfitzpatr...@gmail.com wrote:
 Hi Dominik

 thanks again for the feedback,

 I had noticed some kernel opps's since the last kernel update that i and 
 they seem to be pointing to DRBD, i will downgrade the kernel again and
 see if this improves things,


 re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9 
 but must have missed this bit.

 user land and kernel module all report the same version.

 I am on my way into the office now and I will apply the changes once
 there


 thanks again

 Jason

 2009/2/12 Dominik Klein d...@in-telegence.net

 Right, this one looks better.



 I'll refer to nodes as 1001 and 1002.



 1002 is your DC.

 You have stonith enabled, but no stonith devices. Disable stonith or get

 and configure a stonith device (_please_ dont use ssh).



 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l

 978). Retries in l 1018 and fails again in l 1035.



 Then, the cluster tries to start drbd on 1001 in l 1079, followed by a

 bunch of kernel messages I don't understand (pretty sure _this_ is the

 first problem you should address!), ending up in the drbd RA not able to

 see the secondary state (1449) and considering the start failed.



 The RA code for this is

 if do_drbdadm up $RESOURCE ; then

 drbd_get_status

 if [ $DRBD_STATE_LOCAL != Secondary ]; then

 ocf_log err $RESOURCE start: not in Secondary mode after start.

 return $OCF_ERR_GENERIC

 fi

 ocf_log debug $RESOURCE start: succeeded.

 return $OCF_SUCCESS

 else

 ocf_log err $RESOURCE: Failed to start up.

 return $OCF_ERR_GENERIC

 fi



 The cluster then successfully stops drbd again (l 1508-1511) and tries

 to start the other clone instance (l 1523).



 Log says

 RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device

 is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0

 disk /dev/sdb /dev/sdb internal --set-defaults --create-device

 --on-io-error=pass_on' terminated with exit code 10


 Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in

 Secondary mode after start.



 So this is interesting. Although stop (basically drbdadm down)

 succeeded, the drbd device is still attached!



 Please try:

 stop the cluster

 drbdadm up $resource

 drbdadm up $resource #again

 echo $?

 drbdadm down $resource

 echo $?

 cat /proc/drbd



 Btw: Does your userland match your kernel module version?



 To bring this to an end: The start of the second clone instance also

 failed, so both instances are unrunnable on the node and no further

 start is tried on 1002.



 Interestingly, then (could not see any attempt before), the cluster

 wants to start drbd on node 1001, but it also fails and also gives those

 kernel messages. In l 2001, each instance has a failed start on each
 node.



 So: Find out about those kernel messages. Can't help much on that

 unfortunately, but there were some threads about things like that on

 drbd-user recently. Maybe you can find answers to that problem there.



 And also: please verify returncodes of drbdadm in your case. Maybe

 that's a drbd tools bug? (can't say for sure, for me, up on an alreay up

 resource gives 1, which is ok).



 Regards

 Dominik



 Jason Fitzpatrick wrote:


  it seems that I had the incorrect version of openais installed (from
 the

  fedora repo vs the HA one)

 

  I have corrected and the hb_report ran correctly using the following

 

  hb_report -u root -f 3pm /tmp/report

 

  Please see attached

 

  Thanks again

 

  Jason





 ___

 Linux-HA mailing list

 Linux-HA@lists.linux-ha.org

 http://lists.linux-ha.org/mailman/listinfo/linux-ha

 See also: http://linux-ha.org/ReportingProblems






 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-12 Thread jayfitzpatrick

return codes

[r...@lpissan1001 ~]# service heartbeat stop
Stopping High-Availability services:
[ OK ]
[r...@lpissan1001 ~]# drbdadm up Storage1
/dev/drbd0: Failure: (124) Device is attached to a disk (use detach first)
Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal  
--set-defaults --create-device --on-io-error=pass_on' terminated with exit  
code 10

[r...@lpissan1001 ~]# drbdadm down Storage1
[r...@lpissan1001 ~]# drbdadm up Storage1
[r...@lpissan1001 ~]# drbdadm up Storage1
/dev/drbd0: Failure: (124) Device is attached to a disk (use detach first)
Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal  
--set-defaults --create-device --on-io-error=pass_on' terminated with exit  
code 10

[r...@lpissan1001 ~]# echo $?
1
[r...@lpissan1001 ~]# drbdadm down Storage1
[r...@lpissan1001 ~]# echo $?
0
[r...@lpissan1001 ~]# cat /proc/drbd
version: 8.3.0 (api:88/proto:86-89)
GIT-hash: 9ba8b93e24d842f0dd3fb1f9b90e8348ddb95829 build by  
r...@lpissan1001.emea.leaseplancorp.net, 2009-02-12 13:13:30

0: cs:Unconfigured
[r...@lpissan1001 ~]#


Jason


On Feb 12, 2009 12:24pm, Dominik Klein d...@in-telegence.net wrote:

Did you look into the returncodes and eventually tell linbit about it?



That would be a big issue.



Regards

Dominik



jayfitzpatr...@gmail.com wrote:

 (Big Cheers and celebrations from this end!!!)



 Finally figured out what the problem was, it seems that the kernel oops

 were being caused by the 8.3 version of DRBD, once downgraded to 8.2.7

 everything started to work as it should. primary / secondary automatic

 fail over is in place and resources are now following the DRBD master!



 Thanks a mill for all the help.



 Jason



 On Feb 12, 2009 8:48am, Jason Fitzpatrick jayfitzpatr...@gmail.com  

wrote:


 Hi Dominik



 thanks again for the feedback,



 I had noticed some kernel opps's since the last kernel update that i  

and


 they seem to be pointing to DRBD, i will downgrade the kernel again and

 see if this improves things,





 re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9

 but must have missed this bit.



 user land and kernel module all report the same version.



 I am on my way into the office now and I will apply the changes once

 there





 thanks again



 Jason



 2009/2/12 Dominik Klein d...@in-telegence.net



 Right, this one looks better.







 I'll refer to nodes as 1001 and 1002.







 1002 is your DC.



 You have stonith enabled, but no stonith devices. Disable stonith or  

get




 and configure a stonith device (_please_ dont use ssh).







 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l



 978). Retries in l 1018 and fails again in l 1035.







 Then, the cluster tries to start drbd on 1001 in l 1079, followed by a



 bunch of kernel messages I don't understand (pretty sure _this_ is the



 first problem you should address!), ending up in the drbd RA not able  

to




 see the secondary state (1449) and considering the start failed.







 The RA code for this is



 if do_drbdadm up $RESOURCE ; then



 drbd_get_status



 if [ $DRBD_STATE_LOCAL != Secondary ]; then



 ocf_log err $RESOURCE start: not in Secondary mode after start.



 return $OCF_ERR_GENERIC



 fi



 ocf_log debug $RESOURCE start: succeeded.



 return $OCF_SUCCESS



 else



 ocf_log err $RESOURCE: Failed to start up.



 return $OCF_ERR_GENERIC



 fi







 The cluster then successfully stops drbd again (l 1508-1511) and tries



 to start the other clone instance (l 1523).







 Log says



 RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device



 is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0



 disk /dev/sdb /dev/sdb internal --set-defaults --create-device



 --on-io-error=pass_on' terminated with exit code 10





 Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in



 Secondary mode after start.







 So this is interesting. Although stop (basically drbdadm down)



 succeeded, the drbd device is still attached!







 Please try:



 stop the cluster



 drbdadm up $resource



 drbdadm up $resource #again



 echo $?



 drbdadm down $resource



 echo $?



 cat /proc/drbd







 Btw: Does your userland match your kernel module version?







 To bring this to an end: The start of the second clone instance also



 failed, so both instances are unrunnable on the node and no further



 start is tried on 1002.







 Interestingly, then (could not see any attempt before), the cluster



 wants to start drbd on node 1001, but it also fails and also gives  

those




 kernel messages. In l 2001, each instance has a failed start on each

 node.







 So: Find out about those kernel messages. Can't help much on that



 unfortunately, but there were some threads about things like that on



 drbd-user recently. Maybe you can find answers to that problem there.







 And also: please verify returncodes of 

Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-11 Thread Jason Fitzpatrick
I have disabled the services and run
drbdadm secondary all
drbdadm detach all
drbdadm down all
service drbd stop

before testing as far as I can see (cat /proc/drbd on both nodes) drbd is
shutdown

cat: /proc/drbd: No such file or directory


I have taken the command that heartbeat is running (drbdsetup /dev/drbd0
disk /dev/sdb /dev/sdb internal --set-defaults --create-device
--on-io-error=pass_on') and run it against the nodes when heartbeat is not
in control and this command will bring the resources online, but re-running
this command will generate the error, so I am kind of leaning twords the
command being run twice?

Thanks

Jason

2009/2/11 Dominik Klein d...@in-telegence.net

 Hi Jason

 any chance you started drbd at boot or the drbd device was active at the
 time you started the cluster resource? If so, read the introduction of
 the howto again and correct your setup.

 Jason Fitzpatrick wrote:
  Hi Dominik
 
  I have upgraded to HB 2.9xx and have been following the instructions that
  you provided (thanks for those) and have added a resource as follows
 
  crm
  configure
  primitive Storage1 ocf:heartbeat:drbd \
  params drbd_resource=Storage1 \
  op monitor role=Master interval=59s timeout=30s \
  op monitor role=Slave interval=60s timeout=30s
  ms DRBD_Storage Storage1 \
  meta clone-max=2 notify=true globally-unique=false target-role=stopped
  commit
  exit
 
  no errors are reported and the resource is visable from within the hb_gui
 
  when I try to bring the resource online with
 
  crm resource start DRBD_Storage
 
  I see the resource attempt to come online and then fail, it seems to be
  starting the services, changing the status of the devices to attached
 (from
  detached) but not setting any device to master
 
  the following is from the ha-log
 
  crmd[8020]: 2009/02/10_17:22:32 info: do_lrm_rsc_op: Performing
  key=7:166:0:b57f7f7c-4e2d-4134-9c14-b1a2b7db11a7 op=Storage1:1_start_0 )
  lrmd[8016]: 2009/02/10_17:22:32 info: rsc:Storage1:1: start
  lrmd[8016]: 2009/02/10_17:22:32 info: RA output:
 (Storage1:1:start:stdout)
  /dev/drbd0: Failure: (124) Device is attached to a disk (use detach
 first)
  Command
   'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults
  --create-device --on-io-error=pass_on' terminated with exit code 10

 This looks like drbdadm up is failing because the device is already
 attached to the lower level storage device.

 Regards
 Dominik

  drbd[22270]:2009/02/10_17:22:32 ERROR: Storage1 start: not in
 Secondary
  mode after start.
  crmd[8020]: 2009/02/10_17:22:32 info: process_lrm_event: LRM operation
  Storage1:1_start_0 (call=189, rc=1, cib-update=380, confirmed=true)
 complete
  unknown e
  rror
  .
 
  I have checked the DRBD device Storage1 and it is in secondary mode after
  the start, and should I choose I can make it primary on either node
 
  Thanks
 
  Jason
 
  2009/2/10 Jason Fitzpatrick jayfitzpatr...@gmail.com
 
  Thanks,
 
  This was the latest version in the Fedora Repos, I will upgrade and see
  what happens
 
  Jason
 
  2009/2/10 Dominik Klein d...@in-telegence.net
 
  Jason Fitzpatrick wrote:
  Hi All
 
  I am having a hell of a time trying to get heartbeat to fail over my
  DRBD
  harddisk and am hoping for some help.
 
  I have a 2 node cluster, heartbeat is working as I am able to fail
 over
  IP
  Addresses and services successfully, but when I try to fail over my
  DRBD
  resource from secondary to primary I am hitting a brick wall, I can
  fail
  over the DRBD resource manually so I know that it does work at some
  level
  DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386
  Please upgrade. Thats too old for reliable master/slave behaviour.
  Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read
  http://www.clusterlabs.org/wiki/Install for install notes.
 
  and using
  heartbeat-gui to configure
  Don't use the gui to configure complex (ie clone or master/slave)
  resources.
 
  Once you upgraded to the latest pacemaker, please refer to
  http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster
  configuration.
 
  Regards
  Dominik
 
  DRBD Resource is called Storage1, the 2 nodes are connected via 2
  x-over
  cables (1 heartbeat 1 Replication)
 
  I have stripped down my config to the bare bones and tried every
 option
  that I can think off but know that I am missing something simple,
 
  I have attached my cib.xml but have removed domain names from the
  systems
  for privacy reasons
 
  Thanks in advance
 
  Jason
 
   cib admin_epoch=0 have_quorum=true ignore_dtd=false
  cib_feature_revision=2.0 num_peers=2 generated=true
  ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48
  epoch=733 num_updates=1 cib-last-written=Mon Feb  9 18:31:19
  2009
 configuration
   crm_config
 cluster_property_set id=cib-bootstrap-options
   attributes
 nvpair id=cib-bootstrap-options-dc-version
  name=dc-version
  value=2.1.3-node: 

Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-11 Thread Dominik Klein
Hi Jason

Jason Fitzpatrick wrote:
 I have disabled the services and run
 drbdadm secondary all
 drbdadm detach all
 drbdadm down all
 service drbd stop
 
 before testing as far as I can see (cat /proc/drbd on both nodes) drbd is
 shutdown
 
 cat: /proc/drbd: No such file or directory

Good.

 I have taken the command that heartbeat is running (drbdsetup /dev/drbd0
 disk /dev/sdb /dev/sdb internal --set-defaults --create-device
 --on-io-error=pass_on') 

The RA actually runs drbdadm up, which translates into this.

 and run it against the nodes when heartbeat is not
 in control and this command will bring the resources online, but re-running
 this command will generate the error, so I am kind of leaning twords the
 command being run twice?

Never seen the cluster do that.

Please post your configuration and logs. hb_report should gather
everything needed and put it into a nice .bz2 archive :)

Regards
Dominik

 Thanks
 
 Jason
 
 2009/2/11 Dominik Klein d...@in-telegence.net
 
 Hi Jason

 any chance you started drbd at boot or the drbd device was active at the
 time you started the cluster resource? If so, read the introduction of
 the howto again and correct your setup.

 Jason Fitzpatrick wrote:
 Hi Dominik

 I have upgraded to HB 2.9xx and have been following the instructions that
 you provided (thanks for those) and have added a resource as follows

 crm
 configure
 primitive Storage1 ocf:heartbeat:drbd \
 params drbd_resource=Storage1 \
 op monitor role=Master interval=59s timeout=30s \
 op monitor role=Slave interval=60s timeout=30s
 ms DRBD_Storage Storage1 \
 meta clone-max=2 notify=true globally-unique=false target-role=stopped
 commit
 exit

 no errors are reported and the resource is visable from within the hb_gui

 when I try to bring the resource online with

 crm resource start DRBD_Storage

 I see the resource attempt to come online and then fail, it seems to be
 starting the services, changing the status of the devices to attached
 (from
 detached) but not setting any device to master

 the following is from the ha-log

 crmd[8020]: 2009/02/10_17:22:32 info: do_lrm_rsc_op: Performing
 key=7:166:0:b57f7f7c-4e2d-4134-9c14-b1a2b7db11a7 op=Storage1:1_start_0 )
 lrmd[8016]: 2009/02/10_17:22:32 info: rsc:Storage1:1: start
 lrmd[8016]: 2009/02/10_17:22:32 info: RA output:
 (Storage1:1:start:stdout)
 /dev/drbd0: Failure: (124) Device is attached to a disk (use detach
 first)
 Command
  'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults
 --create-device --on-io-error=pass_on' terminated with exit code 10
 This looks like drbdadm up is failing because the device is already
 attached to the lower level storage device.

 Regards
 Dominik

 drbd[22270]:2009/02/10_17:22:32 ERROR: Storage1 start: not in
 Secondary
 mode after start.
 crmd[8020]: 2009/02/10_17:22:32 info: process_lrm_event: LRM operation
 Storage1:1_start_0 (call=189, rc=1, cib-update=380, confirmed=true)
 complete
 unknown e
 rror
 .

 I have checked the DRBD device Storage1 and it is in secondary mode after
 the start, and should I choose I can make it primary on either node

 Thanks

 Jason

 2009/2/10 Jason Fitzpatrick jayfitzpatr...@gmail.com

 Thanks,

 This was the latest version in the Fedora Repos, I will upgrade and see
 what happens

 Jason

 2009/2/10 Dominik Klein d...@in-telegence.net

 Jason Fitzpatrick wrote:
 Hi All

 I am having a hell of a time trying to get heartbeat to fail over my
 DRBD
 harddisk and am hoping for some help.

 I have a 2 node cluster, heartbeat is working as I am able to fail
 over
 IP
 Addresses and services successfully, but when I try to fail over my
 DRBD
 resource from secondary to primary I am hitting a brick wall, I can
 fail
 over the DRBD resource manually so I know that it does work at some
 level
 DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386
 Please upgrade. Thats too old for reliable master/slave behaviour.
 Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read
 http://www.clusterlabs.org/wiki/Install for install notes.

 and using
 heartbeat-gui to configure
 Don't use the gui to configure complex (ie clone or master/slave)
 resources.

 Once you upgraded to the latest pacemaker, please refer to
 http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster
 configuration.

 Regards
 Dominik

 DRBD Resource is called Storage1, the 2 nodes are connected via 2
 x-over
 cables (1 heartbeat 1 Replication)

 I have stripped down my config to the bare bones and tried every
 option
 that I can think off but know that I am missing something simple,

 I have attached my cib.xml but have removed domain names from the
 systems
 for privacy reasons

 Thanks in advance

 Jason

  cib admin_epoch=0 have_quorum=true ignore_dtd=false
 cib_feature_revision=2.0 num_peers=2 generated=true
 ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48
 epoch=733 num_updates=1 cib-last-written=Mon Feb  9 18:31:19
 2009
configuration
  crm_config
 

Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-11 Thread Jason Fitzpatrick
Hi Dominik

Thanks for the follow up, please find the file attached

Jason


2009/2/11 Dominik Klein d...@in-telegence.net

 Hi Jason

 Jason Fitzpatrick wrote:
  I have disabled the services and run
  drbdadm secondary all
  drbdadm detach all
  drbdadm down all
  service drbd stop
 
  before testing as far as I can see (cat /proc/drbd on both nodes) drbd is
  shutdown
 
  cat: /proc/drbd: No such file or directory

 Good.

  I have taken the command that heartbeat is running (drbdsetup /dev/drbd0
  disk /dev/sdb /dev/sdb internal --set-defaults --create-device
  --on-io-error=pass_on')

 The RA actually runs drbdadm up, which translates into this.

  and run it against the nodes when heartbeat is not
  in control and this command will bring the resources online, but
 re-running
  this command will generate the error, so I am kind of leaning twords the
  command being run twice?

 Never seen the cluster do that.

 Please post your configuration and logs. hb_report should gather
 everything needed and put it into a nice .bz2 archive :)

 Regards
 Dominik

  Thanks
 
  Jason
 
  2009/2/11 Dominik Klein d...@in-telegence.net
 
  Hi Jason
 
  any chance you started drbd at boot or the drbd device was active at the
  time you started the cluster resource? If so, read the introduction of
  the howto again and correct your setup.
 
  Jason Fitzpatrick wrote:
  Hi Dominik
 
  I have upgraded to HB 2.9xx and have been following the instructions
 that
  you provided (thanks for those) and have added a resource as follows
 
  crm
  configure
  primitive Storage1 ocf:heartbeat:drbd \
  params drbd_resource=Storage1 \
  op monitor role=Master interval=59s timeout=30s \
  op monitor role=Slave interval=60s timeout=30s
  ms DRBD_Storage Storage1 \
  meta clone-max=2 notify=true globally-unique=false target-role=stopped
  commit
  exit
 
  no errors are reported and the resource is visable from within the
 hb_gui
 
  when I try to bring the resource online with
 
  crm resource start DRBD_Storage
 
  I see the resource attempt to come online and then fail, it seems to be
  starting the services, changing the status of the devices to attached
  (from
  detached) but not setting any device to master
 
  the following is from the ha-log
 
  crmd[8020]: 2009/02/10_17:22:32 info: do_lrm_rsc_op: Performing
  key=7:166:0:b57f7f7c-4e2d-4134-9c14-b1a2b7db11a7 op=Storage1:1_start_0
 )
  lrmd[8016]: 2009/02/10_17:22:32 info: rsc:Storage1:1: start
  lrmd[8016]: 2009/02/10_17:22:32 info: RA output:
  (Storage1:1:start:stdout)
  /dev/drbd0: Failure: (124) Device is attached to a disk (use detach
  first)
  Command
   'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults
  --create-device --on-io-error=pass_on' terminated with exit code 10
  This looks like drbdadm up is failing because the device is already
  attached to the lower level storage device.
 
  Regards
  Dominik
 
  drbd[22270]:2009/02/10_17:22:32 ERROR: Storage1 start: not in
  Secondary
  mode after start.
  crmd[8020]: 2009/02/10_17:22:32 info: process_lrm_event: LRM operation
  Storage1:1_start_0 (call=189, rc=1, cib-update=380, confirmed=true)
  complete
  unknown e
  rror
  .
 
  I have checked the DRBD device Storage1 and it is in secondary mode
 after
  the start, and should I choose I can make it primary on either node
 
  Thanks
 
  Jason
 
  2009/2/10 Jason Fitzpatrick jayfitzpatr...@gmail.com
 
  Thanks,
 
  This was the latest version in the Fedora Repos, I will upgrade and
 see
  what happens
 
  Jason
 
  2009/2/10 Dominik Klein d...@in-telegence.net
 
  Jason Fitzpatrick wrote:
  Hi All
 
  I am having a hell of a time trying to get heartbeat to fail over
 my
  DRBD
  harddisk and am hoping for some help.
 
  I have a 2 node cluster, heartbeat is working as I am able to fail
  over
  IP
  Addresses and services successfully, but when I try to fail over my
  DRBD
  resource from secondary to primary I am hitting a brick wall, I can
  fail
  over the DRBD resource manually so I know that it does work at some
  level
  DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386
  Please upgrade. Thats too old for reliable master/slave behaviour.
  Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read
  http://www.clusterlabs.org/wiki/Install for install notes.
 
  and using
  heartbeat-gui to configure
  Don't use the gui to configure complex (ie clone or master/slave)
  resources.
 
  Once you upgraded to the latest pacemaker, please refer to
  http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster
  configuration.
 
  Regards
  Dominik
 
  DRBD Resource is called Storage1, the 2 nodes are connected via 2
  x-over
  cables (1 heartbeat 1 Replication)
 
  I have stripped down my config to the bare bones and tried every
  option
  that I can think off but know that I am missing something simple,
 
  I have attached my cib.xml but have removed domain names from the
  systems
  for privacy reasons
 
  Thanks in advance
 

Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-11 Thread Dominik Klein
The archive only contains info for one node and the logfile is empty.
Did you use appropriate -f time and does ssh work between the nodes?

So far, nothing obvious to me except for the order between your FS and
DRBD lacking the role definition, but that's not what your problem is
about (yet *g*).

Regards
Dominik

Jason Fitzpatrick wrote:
 Hi Dominik
 
 Thanks for the follow up, please find the file attached
 
 Jason
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-11 Thread Jason Fitzpatrick
Hi Dominik

I have re-run with the following command line

hb_report -u root -f 1pm /tmp/report

but get an error

/usr/share/heartbeat/utillib.sh: line 288:  4790 Segmentation fault
ccm_tool -p  $1/$CCMTOOL_F 21

I have tried to clean up and re-run from both nodes but no difference.

the new export is attached and I am working on the SSH trust issues now (ssh
from one server to the other works fine via shortname / long name - no user
name / password prompt)

Thanks

Jason

2009/2/11 Dominik Klein d...@in-telegence.net

 The archive only contains info for one node and the logfile is empty.
 Did you use appropriate -f time and does ssh work between the nodes?

 So far, nothing obvious to me except for the order between your FS and
 DRBD lacking the role definition, but that's not what your problem is
 about (yet *g*).

 Regards
 Dominik

 Jason Fitzpatrick wrote:
  Hi Dominik
 
  Thanks for the follow up, please find the file attached
 
  Jason
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-11 Thread Dominik Klein
Right, this one looks better.

I'll refer to nodes as 1001 and 1002.

1002 is your DC.
You have stonith enabled, but no stonith devices. Disable stonith or get
and configure a stonith device (_please_ dont use ssh).

1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l
978). Retries in l 1018 and fails again in l 1035.

Then, the cluster tries to start drbd on 1001 in l 1079, followed by a
bunch of kernel messages I don't understand (pretty sure _this_ is the
first problem you should address!), ending up in the drbd RA not able to
see the secondary state (1449) and considering the start failed.

The RA code for this is
if do_drbdadm up $RESOURCE ; then
 drbd_get_status
 if [ $DRBD_STATE_LOCAL != Secondary ]; then
  ocf_log err $RESOURCE start: not in Secondary mode after start.
  return $OCF_ERR_GENERIC
 fi
 ocf_log debug $RESOURCE start: succeeded.
 return $OCF_SUCCESS
 else
  ocf_log err $RESOURCE: Failed to start up.
  return $OCF_ERR_GENERIC
fi

The cluster then successfully stops drbd again (l 1508-1511) and tries
to start the other clone instance (l 1523).

Log says
RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device
is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0
disk /dev/sdb /dev/sdb internal --set-defaults --create-device
--on-io-error=pass_on' terminated with exit code 10
Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in
Secondary mode after start.

So this is interesting. Although stop (basically drbdadm down)
succeeded, the drbd device is still attached!

Please try:
stop the cluster
drbdadm up $resource
drbdadm up $resource #again
echo $?
drbdadm down $resource
echo $?
cat /proc/drbd

Btw: Does your userland match your kernel module version?

To bring this to an end: The start of the second clone instance also
failed, so both instances are unrunnable on the node and no further
start is tried on 1002.

Interestingly, then (could not see any attempt before), the cluster
wants to start drbd on node 1001, but it also fails and also gives those
kernel messages. In l 2001, each instance has a failed start on each node.

So: Find out about those kernel messages. Can't help much on that
unfortunately, but there were some threads about things like that on
drbd-user recently. Maybe you can find answers to that problem there.

And also: please verify returncodes of drbdadm in your case. Maybe
that's a drbd tools bug? (can't say for sure, for me, up on an alreay up
resource gives 1, which is ok).

Regards
Dominik

Jason Fitzpatrick wrote:
 it seems that I had the incorrect version of openais installed (from the
 fedora repo vs the HA one)
 
 I have corrected and the hb_report ran correctly using the following
 
  hb_report -u root -f 3pm /tmp/report
 
 Please see attached
 
 Thanks again
 
 Jason

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-11 Thread Dominik Klein
Dominik Klein wrote:
 Right, this one looks better.
 
 I'll refer to nodes as 1001 and 1002.
 
 1002 is your DC.
 You have stonith enabled, but no stonith devices. Disable stonith or get
 and configure a stonith device (_please_ dont use ssh).
 
 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l
 978). Retries in l 1018 and fails again in l 1035.
 
 Then, the cluster tries to start drbd on 1001 in l 1079, 

s/1001/1002

sorry
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-10 Thread Dominik Klein
Jason Fitzpatrick wrote:
 Hi All

 I am having a hell of a time trying to get heartbeat to fail over my DRBD
 harddisk and am hoping for some help.

 I have a 2 node cluster, heartbeat is working as I am able to fail over IP
 Addresses and services successfully, but when I try to fail over my DRBD
 resource from secondary to primary I am hitting a brick wall, I can fail
 over the DRBD resource manually so I know that it does work at some level

 DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386 

Please upgrade. Thats too old for reliable master/slave behaviour.
Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read
http://www.clusterlabs.org/wiki/Install for install notes.

 and using
 heartbeat-gui to configure

Don't use the gui to configure complex (ie clone or master/slave) resources.

Once you upgraded to the latest pacemaker, please refer to
http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster
configuration.

Regards
Dominik

 DRBD Resource is called Storage1, the 2 nodes are connected via 2 x-over
 cables (1 heartbeat 1 Replication)

 I have stripped down my config to the bare bones and tried every option
 that I can think off but know that I am missing something simple,

 I have attached my cib.xml but have removed domain names from the systems
 for privacy reasons

 Thanks in advance

 Jason

  cib admin_epoch=0 have_quorum=true ignore_dtd=false
 cib_feature_revision=2.0 num_peers=2 generated=true
 ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48
 epoch=733 num_updates=1 cib-last-written=Mon Feb  9 18:31:19 2009
configuration
  crm_config
cluster_property_set id=cib-bootstrap-options
  attributes
nvpair id=cib-bootstrap-options-dc-version name=dc-version
 value=2.1.3-node: 552305612591183b1628baa5bc6e903e0f1e26a3/
nvpair name=last-lrm-refresh
 id=cib-bootstrap-options-last-lrm-refresh value=1234204278/
  /attributes
/cluster_property_set
  /crm_config
  nodes
node id=df707752-d5fb-405a-8ca7-049e25a227b7 uname=lpissan1001
 type=normal
  instance_attributes
 id=nodes-df707752-d5fb-405a-8ca7-049e25a227b7
attributes
  nvpair id=standby-df707752-d5fb-405a-8ca7-049e25a227b7
 name=standby value=off/
/attributes
  /instance_attributes
/node
node id=9d8abc28-4fa3-408a-a695-fb36b0d67a48 uname=lpissan1002
 type=normal
  instance_attributes
 id=nodes-9d8abc28-4fa3-408a-a695-fb36b0d67a48
attributes
  nvpair id=standby-9d8abc28-4fa3-408a-a695-fb36b0d67a48
 name=standby value=off/
/attributes
  /instance_attributes
/node
  /nodes
  resources
master_slave id=Storage1
  meta_attributes id=Storage1_meta_attrs
attributes
  nvpair id=Storage1_metaattr_target_role name=target_role
 value=started/
  nvpair id=Storage1_metaattr_clone_max name=clone_max
 value=2/
  nvpair id=Storage1_metaattr_clone_node_max
 name=clone_node_max value=1/
  nvpair id=Storage1_metaattr_master_max name=master_max
 value=1/
  nvpair id=Storage1_metaattr_master_node_max
 name=master_node_max value=1/
  nvpair id=Storage1_metaattr_notify name=notify
 value=true/
  nvpair id=Storage1_metaattr_globally_unique
 name=globally_unique value=false/
/attributes
  /meta_attributes
  primitive id=Storage1 class=ocf type=drbd
 provider=heartbeat
instance_attributes id=Storage1_instance_attrs
  attributes
nvpair id=273a1bb2-4867-42dd-a9e5-7cebbf48ef3b
 name=drbd_resource value=Storage1/
  /attributes
/instance_attributes
operations
  op id=9ddc0ce9-4090-4546-a7d5-787fe47de872 name=monitor
 description=master interval=29 timeout=10 start_delay=1m
 role=Master/
  op id=56a7508f-fa42-46f8-9924-3b284cdb97f0 name=monitor
 description=slave interval=29 timeout=10 start_delay=1m
 role=Slave/
/operations
  /primitive
/master_slave
  /resources
  constraints/
/configuration
  /cib


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-10 Thread Jason Fitzpatrick
Thanks,

This was the latest version in the Fedora Repos, I will upgrade and see what
happens

Jason

2009/2/10 Dominik Klein d...@in-telegence.net

 Jason Fitzpatrick wrote:
  Hi All
 
  I am having a hell of a time trying to get heartbeat to fail over my
 DRBD
  harddisk and am hoping for some help.
 
  I have a 2 node cluster, heartbeat is working as I am able to fail over
 IP
  Addresses and services successfully, but when I try to fail over my DRBD
  resource from secondary to primary I am hitting a brick wall, I can fail
  over the DRBD resource manually so I know that it does work at some
 level
 
  DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386

 Please upgrade. Thats too old for reliable master/slave behaviour.
 Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read
 http://www.clusterlabs.org/wiki/Install for install notes.

  and using
  heartbeat-gui to configure

 Don't use the gui to configure complex (ie clone or master/slave)
 resources.

 Once you upgraded to the latest pacemaker, please refer to
 http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster
 configuration.

 Regards
 Dominik

  DRBD Resource is called Storage1, the 2 nodes are connected via 2 x-over
  cables (1 heartbeat 1 Replication)
 
  I have stripped down my config to the bare bones and tried every option
  that I can think off but know that I am missing something simple,
 
  I have attached my cib.xml but have removed domain names from the
 systems
  for privacy reasons
 
  Thanks in advance
 
  Jason
 
   cib admin_epoch=0 have_quorum=true ignore_dtd=false
  cib_feature_revision=2.0 num_peers=2 generated=true
  ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48
  epoch=733 num_updates=1 cib-last-written=Mon Feb  9 18:31:19 2009
 configuration
   crm_config
 cluster_property_set id=cib-bootstrap-options
   attributes
 nvpair id=cib-bootstrap-options-dc-version
 name=dc-version
  value=2.1.3-node: 552305612591183b1628baa5bc6e903e0f1e26a3/
 nvpair name=last-lrm-refresh
  id=cib-bootstrap-options-last-lrm-refresh value=1234204278/
   /attributes
 /cluster_property_set
   /crm_config
   nodes
 node id=df707752-d5fb-405a-8ca7-049e25a227b7
 uname=lpissan1001
  type=normal
   instance_attributes
  id=nodes-df707752-d5fb-405a-8ca7-049e25a227b7
 attributes
   nvpair id=standby-df707752-d5fb-405a-8ca7-049e25a227b7
  name=standby value=off/
 /attributes
   /instance_attributes
 /node
 node id=9d8abc28-4fa3-408a-a695-fb36b0d67a48
 uname=lpissan1002
  type=normal
   instance_attributes
  id=nodes-9d8abc28-4fa3-408a-a695-fb36b0d67a48
 attributes
   nvpair id=standby-9d8abc28-4fa3-408a-a695-fb36b0d67a48
  name=standby value=off/
 /attributes
   /instance_attributes
 /node
   /nodes
   resources
 master_slave id=Storage1
   meta_attributes id=Storage1_meta_attrs
 attributes
   nvpair id=Storage1_metaattr_target_role
 name=target_role
  value=started/
   nvpair id=Storage1_metaattr_clone_max name=clone_max
  value=2/
   nvpair id=Storage1_metaattr_clone_node_max
  name=clone_node_max value=1/
   nvpair id=Storage1_metaattr_master_max name=master_max
  value=1/
   nvpair id=Storage1_metaattr_master_node_max
  name=master_node_max value=1/
   nvpair id=Storage1_metaattr_notify name=notify
  value=true/
   nvpair id=Storage1_metaattr_globally_unique
  name=globally_unique value=false/
 /attributes
   /meta_attributes
   primitive id=Storage1 class=ocf type=drbd
  provider=heartbeat
 instance_attributes id=Storage1_instance_attrs
   attributes
 nvpair id=273a1bb2-4867-42dd-a9e5-7cebbf48ef3b
  name=drbd_resource value=Storage1/
   /attributes
 /instance_attributes
 operations
   op id=9ddc0ce9-4090-4546-a7d5-787fe47de872
 name=monitor
  description=master interval=29 timeout=10 start_delay=1m
  role=Master/
   op id=56a7508f-fa42-46f8-9924-3b284cdb97f0
 name=monitor
  description=slave interval=29 timeout=10 start_delay=1m
  role=Slave/
 /operations
   /primitive
 /master_slave
   /resources
   constraints/
 /configuration
   /cib
 
 
  ___
  Linux-HA mailing list
  Linux-HA@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha
  See also: http://linux-ha.org/ReportingProblems
 

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list

Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-10 Thread Jason Fitzpatrick
Hi Dominik

I have upgraded to HB 2.9xx and have been following the instructions that
you provided (thanks for those) and have added a resource as follows

crm
configure
primitive Storage1 ocf:heartbeat:drbd \
params drbd_resource=Storage1 \
op monitor role=Master interval=59s timeout=30s \
op monitor role=Slave interval=60s timeout=30s
ms DRBD_Storage Storage1 \
meta clone-max=2 notify=true globally-unique=false target-role=stopped
commit
exit

no errors are reported and the resource is visable from within the hb_gui

when I try to bring the resource online with

crm resource start DRBD_Storage

I see the resource attempt to come online and then fail, it seems to be
starting the services, changing the status of the devices to attached (from
detached) but not setting any device to master

the following is from the ha-log

crmd[8020]: 2009/02/10_17:22:32 info: do_lrm_rsc_op: Performing
key=7:166:0:b57f7f7c-4e2d-4134-9c14-b1a2b7db11a7 op=Storage1:1_start_0 )
lrmd[8016]: 2009/02/10_17:22:32 info: rsc:Storage1:1: start
lrmd[8016]: 2009/02/10_17:22:32 info: RA output: (Storage1:1:start:stdout)
/dev/drbd0: Failure: (124) Device is attached to a disk (use detach first)
Command
 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults
--create-device --on-io-error=pass_on' terminated with exit code 10

drbd[22270]:2009/02/10_17:22:32 ERROR: Storage1 start: not in Secondary
mode after start.
crmd[8020]: 2009/02/10_17:22:32 info: process_lrm_event: LRM operation
Storage1:1_start_0 (call=189, rc=1, cib-update=380, confirmed=true) complete
unknown e
rror
.

I have checked the DRBD device Storage1 and it is in secondary mode after
the start, and should I choose I can make it primary on either node

Thanks

Jason

2009/2/10 Jason Fitzpatrick jayfitzpatr...@gmail.com

 Thanks,

 This was the latest version in the Fedora Repos, I will upgrade and see
 what happens

 Jason

 2009/2/10 Dominik Klein d...@in-telegence.net

 Jason Fitzpatrick wrote:
  Hi All
 
  I am having a hell of a time trying to get heartbeat to fail over my
 DRBD
  harddisk and am hoping for some help.
 
  I have a 2 node cluster, heartbeat is working as I am able to fail over
 IP
  Addresses and services successfully, but when I try to fail over my
 DRBD
  resource from secondary to primary I am hitting a brick wall, I can
 fail
  over the DRBD resource manually so I know that it does work at some
 level
 
  DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386

 Please upgrade. Thats too old for reliable master/slave behaviour.
 Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read
 http://www.clusterlabs.org/wiki/Install for install notes.

  and using
  heartbeat-gui to configure

 Don't use the gui to configure complex (ie clone or master/slave)
 resources.

 Once you upgraded to the latest pacemaker, please refer to
 http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster
 configuration.

 Regards
 Dominik

  DRBD Resource is called Storage1, the 2 nodes are connected via 2
 x-over
  cables (1 heartbeat 1 Replication)
 
  I have stripped down my config to the bare bones and tried every option
  that I can think off but know that I am missing something simple,
 
  I have attached my cib.xml but have removed domain names from the
 systems
  for privacy reasons
 
  Thanks in advance
 
  Jason
 
   cib admin_epoch=0 have_quorum=true ignore_dtd=false
  cib_feature_revision=2.0 num_peers=2 generated=true
  ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48
  epoch=733 num_updates=1 cib-last-written=Mon Feb  9 18:31:19
 2009
 configuration
   crm_config
 cluster_property_set id=cib-bootstrap-options
   attributes
 nvpair id=cib-bootstrap-options-dc-version
 name=dc-version
  value=2.1.3-node: 552305612591183b1628baa5bc6e903e0f1e26a3/
 nvpair name=last-lrm-refresh
  id=cib-bootstrap-options-last-lrm-refresh value=1234204278/
   /attributes
 /cluster_property_set
   /crm_config
   nodes
 node id=df707752-d5fb-405a-8ca7-049e25a227b7
 uname=lpissan1001
  type=normal
   instance_attributes
  id=nodes-df707752-d5fb-405a-8ca7-049e25a227b7
 attributes
   nvpair id=standby-df707752-d5fb-405a-8ca7-049e25a227b7
  name=standby value=off/
 /attributes
   /instance_attributes
 /node
 node id=9d8abc28-4fa3-408a-a695-fb36b0d67a48
 uname=lpissan1002
  type=normal
   instance_attributes
  id=nodes-9d8abc28-4fa3-408a-a695-fb36b0d67a48
 attributes
   nvpair id=standby-9d8abc28-4fa3-408a-a695-fb36b0d67a48
  name=standby value=off/
 /attributes
   /instance_attributes
 /node
   /nodes
   resources
 master_slave id=Storage1
   meta_attributes id=Storage1_meta_attrs
 attributes
   nvpair id=Storage1_metaattr_target_role
 name=target_role
  value=started/
   

Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-10 Thread Dominik Klein
Hi Jason

any chance you started drbd at boot or the drbd device was active at the
time you started the cluster resource? If so, read the introduction of
the howto again and correct your setup.

Jason Fitzpatrick wrote:
 Hi Dominik
 
 I have upgraded to HB 2.9xx and have been following the instructions that
 you provided (thanks for those) and have added a resource as follows
 
 crm
 configure
 primitive Storage1 ocf:heartbeat:drbd \
 params drbd_resource=Storage1 \
 op monitor role=Master interval=59s timeout=30s \
 op monitor role=Slave interval=60s timeout=30s
 ms DRBD_Storage Storage1 \
 meta clone-max=2 notify=true globally-unique=false target-role=stopped
 commit
 exit
 
 no errors are reported and the resource is visable from within the hb_gui
 
 when I try to bring the resource online with
 
 crm resource start DRBD_Storage
 
 I see the resource attempt to come online and then fail, it seems to be
 starting the services, changing the status of the devices to attached (from
 detached) but not setting any device to master
 
 the following is from the ha-log
 
 crmd[8020]: 2009/02/10_17:22:32 info: do_lrm_rsc_op: Performing
 key=7:166:0:b57f7f7c-4e2d-4134-9c14-b1a2b7db11a7 op=Storage1:1_start_0 )
 lrmd[8016]: 2009/02/10_17:22:32 info: rsc:Storage1:1: start
 lrmd[8016]: 2009/02/10_17:22:32 info: RA output: (Storage1:1:start:stdout)
 /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first)
 Command
  'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults
 --create-device --on-io-error=pass_on' terminated with exit code 10

This looks like drbdadm up is failing because the device is already
attached to the lower level storage device.

Regards
Dominik

 drbd[22270]:2009/02/10_17:22:32 ERROR: Storage1 start: not in Secondary
 mode after start.
 crmd[8020]: 2009/02/10_17:22:32 info: process_lrm_event: LRM operation
 Storage1:1_start_0 (call=189, rc=1, cib-update=380, confirmed=true) complete
 unknown e
 rror
 .
 
 I have checked the DRBD device Storage1 and it is in secondary mode after
 the start, and should I choose I can make it primary on either node
 
 Thanks
 
 Jason
 
 2009/2/10 Jason Fitzpatrick jayfitzpatr...@gmail.com
 
 Thanks,

 This was the latest version in the Fedora Repos, I will upgrade and see
 what happens

 Jason

 2009/2/10 Dominik Klein d...@in-telegence.net

 Jason Fitzpatrick wrote:
 Hi All

 I am having a hell of a time trying to get heartbeat to fail over my
 DRBD
 harddisk and am hoping for some help.

 I have a 2 node cluster, heartbeat is working as I am able to fail over
 IP
 Addresses and services successfully, but when I try to fail over my
 DRBD
 resource from secondary to primary I am hitting a brick wall, I can
 fail
 over the DRBD resource manually so I know that it does work at some
 level
 DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386
 Please upgrade. Thats too old for reliable master/slave behaviour.
 Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read
 http://www.clusterlabs.org/wiki/Install for install notes.

 and using
 heartbeat-gui to configure
 Don't use the gui to configure complex (ie clone or master/slave)
 resources.

 Once you upgraded to the latest pacemaker, please refer to
 http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster
 configuration.

 Regards
 Dominik

 DRBD Resource is called Storage1, the 2 nodes are connected via 2
 x-over
 cables (1 heartbeat 1 Replication)

 I have stripped down my config to the bare bones and tried every option
 that I can think off but know that I am missing something simple,

 I have attached my cib.xml but have removed domain names from the
 systems
 for privacy reasons

 Thanks in advance

 Jason

  cib admin_epoch=0 have_quorum=true ignore_dtd=false
 cib_feature_revision=2.0 num_peers=2 generated=true
 ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48
 epoch=733 num_updates=1 cib-last-written=Mon Feb  9 18:31:19
 2009
configuration
  crm_config
cluster_property_set id=cib-bootstrap-options
  attributes
nvpair id=cib-bootstrap-options-dc-version
 name=dc-version
 value=2.1.3-node: 552305612591183b1628baa5bc6e903e0f1e26a3/
nvpair name=last-lrm-refresh
 id=cib-bootstrap-options-last-lrm-refresh value=1234204278/
  /attributes
/cluster_property_set
  /crm_config
  nodes
node id=df707752-d5fb-405a-8ca7-049e25a227b7
 uname=lpissan1001
 type=normal
  instance_attributes
 id=nodes-df707752-d5fb-405a-8ca7-049e25a227b7
attributes
  nvpair id=standby-df707752-d5fb-405a-8ca7-049e25a227b7
 name=standby value=off/
/attributes
  /instance_attributes
/node
node id=9d8abc28-4fa3-408a-a695-fb36b0d67a48
 uname=lpissan1002
 type=normal
  instance_attributes
 id=nodes-9d8abc28-4fa3-408a-a695-fb36b0d67a48
attributes
  nvpair id=standby-9d8abc28-4fa3-408a-a695-fb36b0d67a48