[Pacemaker] HA across WDM Fibre link - Nodes won't rejoin after reboot

2012-04-02 Thread Darren.Mansell
Hi everyone.

 

I have 2 nodes running on ESX hosts in 2 geographically diverse data
centres. The link between them is a DWDM fibre link which is the only
thing I can think of as being the cause of this.

 

SLES 11 SP1 with HAE. All latest updates.

 

If Corosync is set to Multicast on the default address, there are no
comms between Corosync on the nodes. If I use broadcast, it will
communicate and let the nodes join.

 

If I reboot node 2, it rejoins fine. If I reboot node 1, it enters a
pending phase for a while then just drops to offline. I can then clear
the config out again and let the nodes rejoin. Node 1 always seems to be
the DC.

 

Pending - logs from node 1, loops this every second:

 

-02: id=336371722 state=member (new) addr=r(0) ip(10.160.12.20)  votes=1
born=7912 seen=7920 proc=00151312

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: crm_update_peer: Node
PPS-VMAIL-01: id=168599562 state=member (new) addr=r(0) ip(10.160.12.10)
(new) votes=1 (new) born=7920 seen=7920
proc=00151312 (new)

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: WARN: do_log: FSA: Input
I_SHUTDOWN from revision_check_callback() received in state S_STARTING

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_state_transition:
State transition S_STARTING - S_STOPPING [ input=I_SHUTDOWN
cause=C_FSA_INTERNAL origin=revision_check_callback ]

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_lrm_control:
Disconnected from the LRM

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_ha_control:
Disconnected from OpenAIS

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_cib_control:
Disconnecting CIB

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_exit: Performing
A_EXIT_0 - gracefully exiting the CRMd

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: free_mem: Dropping
I_NULL: [ state=S_STOPPING cause=C_FSA_INTERNAL
origin=register_fsa_error_adv ]

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: free_mem: Dropping
I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ]

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_exit: [crmd] stopped
(0)

 

Offline - logs from node 1, loops every second:

 

Apr  2 14:38:06 PPS-VMAIL-01 cib: [3510]: info: cib_replace_notify:
Local-only Replace: 0.0.0 from PP2-VMAIL-02

Apr  2 14:38:06 PPS-VMAIL-01 attrd: [3512]: info: do_cib_replaced:
Sending full refresh

Apr  2 14:38:06 PPS-VMAIL-01 attrd: [3512]: info: attrd_trigger_update:
Sending flush op to all hosts for: probe_complete (null)

Apr  2 14:38:06 PPS-VMAIL-01 cib: [3510]: info: apply_xml_diff: Digest
mis-match: expected 0cf389141d344ca552679f9924d281c5, calculated
818a100a0e3b725068393624381c9d4f

Apr  2 14:38:06 PPS-VMAIL-01 cib: [3510]: notice: cib_process_diff: Diff
0.13.642 - 0.0.0 not applied to 0.13.642: Failed application of an
update diff

Apr  2 14:38:06 PPS-VMAIL-01 cib: [3510]: info: cib_server_process_diff:
Requesting re-sync from peer

Apr  2 14:38:06 PPS-VMAIL-01 cib: [3510]: WARN: cib_diff_notify:
Local-only Change (client:attrd, call: 1221): 0.0.0 (Application of an
update diff failed, requesting a full refresh)

 

Offline - logs from node 2, loops every second:

 

Apr  2 14:39:05 PP2-VMAIL-02 corosync[3794]:  [TOTEM ] Retransmit List:
29b7 29b8 29b9

Apr  2 14:39:05 PP2-VMAIL-02 corosync[3794]:  [TOTEM ] Retransmit List:
29bb 29bc

Apr  2 14:39:05 PP2-VMAIL-02 cib: [3801]: info: cib_process_request:
Operation complete: op cib_sync_one for section 'all'
(origin=PPS-VMAIL-01/PPS-VMAIL-01/(null), version=0.13.1538): ok (rc=0)

 

Any ideas please?

 

Thanks.

 

Darren Mansell

 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] HA across WDM Fibre link - Nodes won't rejoin after reboot

2012-04-02 Thread Darren.Mansell
   On 2012-04-02T14:53:53, darren.mans...@opengi.co.uk wrote:

I have 2 nodes running on ESX hosts in 2 geographically
diverse data 
centres. The link between them is a DWDM fibre link which is
the only 
thing I can think of as being the cause of this.

SLES 11 SP1 with HAE. All latest updates.

   That looks timing related; what bandwidth/latency do you get
between the two sites?

No theoretical bandwidth quoted from supplier, but in the hundreds of
Mbits I believe. 670ns latency.


   You know that geographical clusters is not an officially
supported deployment scenario for SP1, right? ;-) That's documented in
the release notes and changing only with SP2.


   Regards,
   Lars

I think the bandwidth and latency are sufficient to happily accept this
as local :) But no, I didn't know that unfortunately. If I can't get
Corosync communications working I'll just have to drop the HA and use
external load-balancers to direct traffic.

thanks.
Darren



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Dual-Primary DRBD with OCFS2 on SLES 11 SP1

2011-09-29 Thread Darren.Mansell
(Originally sent to DRBD-user, reposted here as it may be more relevant)


 

Hello all.

 

I'm implementing a 2-node cluster using Corosync/Pacemaker/DRBD/OCFS2
for dual-primary shared FS.

 

I've followed the instructions on the DRBD applications site and it
works really well.

 

However, if I 'pull the plug' on a node, the other node continues to
operate the clones, but the filesystem is locked and inaccessible (the
monitor op works for the filesystem, but fails for the OCFS2 resource.)

 

If I do a reboot one node, there are no problems and I can continue to
access the OCFS2 FS.

 

After I pull the plug:

 

Online: [ test-odp-02 ]

OFFLINE: [ test-odp-01 ]

 

Resource Group: Load-Balancing

 Virtual-IP-ODP (ocf::heartbeat:IPaddr2):   Started
test-odp-02

 Virtual-IP-ODPWS   (ocf::heartbeat:IPaddr2):   Started
test-odp-02

 ldirectord (ocf::heartbeat:ldirectord):Started test-odp-02

Master/Slave Set: ms_drbd_ocfs2 [p_drbd_ocfs2]

 Masters: [ test-odp-02 ]

 Stopped: [ p_drbd_ocfs2:1 ]

Clone Set: cl-odp [odp]

 Started: [ test-odp-02 ]

 Stopped: [ odp:1 ]

Clone Set: cl-odpws [odpws]

 Started: [ test-odp-02 ]

 Stopped: [ odpws:1 ]

Clone Set: cl_fs_ocfs2 [p_fs_ocfs2]

 Started: [ test-odp-02 ]

 Stopped: [ p_fs_ocfs2:1 ]

Clone Set: cl_ocfs2mgmt [g_ocfs2mgmt]

 Started: [ test-odp-02 ]

 Stopped: [ g_ocfs2mgmt:1 ]

 

Failed actions:

p_o2cb:0_monitor_1 (node=test-odp-02, call=19, rc=-2,
status=Timed Out): unknown

exec error

 

 

test-odp-02:~ # mount

/dev/drbd0 on /opt/odp type ocfs2
(rw,_netdev,noatime,cluster_stack=pcmk)

 

test-odp-02:~ # ls /opt/odp

...just hangs forever...

 

If I then power test-odp-01 back on, everything fails back fine and the
ls command suddenly completes.

 

It seems to me that OCFS2 is trying to talk to the node that has
disappeared and doesn't time out. Does anyone have any ideas? (attached
CRM and DRBD configs)

 

Many thanks.

 

Darren Mansell

 



drbd.conf
Description: drbd.conf
node test-odp-01
node test-odp-02 \
attributes standby=off
primitive Virtual-IP-ODP ocf:heartbeat:IPaddr2 \
params lvs_support=true ip=2.21.15.100 cidr_netmask=8 
broadcast=2.255.255.255 \
op monitor interval=1m timeout=10s \
meta migration-threshold=10 failure-timeout=600
primitive Virtual-IP-ODPWS ocf:heartbeat:IPaddr2 \
params lvs_support=true ip=2.21.15.103 cidr_netmask=8 
broadcast=2.255.255.255 \
op monitor interval=1m timeout=10s \
meta migration-threshold=10 failure-timeout=600
primitive ldirectord ocf:heartbeat:ldirectord \
params configfile=/etc/ha.d/ldirectord.cf \
op monitor interval=2m timeout=20s \
meta migration-threshold=10 failure-timeout=600
primitive odp lsb:odp \
op monitor interval=10s enabled=true timeout=10s \
meta migration-threshold=10 failure-timeout=600
primitive odpwebservice lsb:odpws \
op monitor interval=10s enabled=true timeout=10s \
meta migration-threshold=10 failure-timeout=600
primitive p_controld ocf:pacemaker:controld \
op monitor interval=10s enabled=true timeout=10s \
meta migration-threshold=10 failure-timeout=600
primitive p_drbd_ocfs2 ocf:linbit:drbd \
params drbd_resource=r0 \
op monitor interval=10s enabled=true timeout=10s \
meta migration-threshold=10 failure-timeout=600
primitive p_fs_ocfs2 ocf:heartbeat:Filesystem \
params device=/dev/drbd/by-res/r0 directory=/opt/odp fstype=ocfs2 
options=rw,noatime \
op monitor interval=10s enabled=true timeout=10s \
meta migration-threshold=10 failure-timeout=600
primitive p_o2cb ocf:ocfs2:o2cb \
op monitor interval=10s enabled=true timeout=10s \
meta migration-threshold=10 failure-timeout=600
group Load-Balancing Virtual-IP-ODP Virtual-IP-ODPWS ldirectord
group g_ocfs2mgmt p_controld p_o2cb
ms ms_drbd_ocfs2 p_drbd_ocfs2 \
meta master-max=2 clone-max=2 notify=true
clone cl-odp odp
clone cl-odpws odpws
clone cl_fs_ocfs2 p_fs_ocfs2 \
meta target-role=Started
clone cl_ocfs2mgmt g_ocfs2mgmt \
meta interleave=true
location Prefer-Node1 ldirectord \
rule $id=prefer-node1-rule 100: #uname eq test-odp-01
order o_ocfs2 inf: ms_drbd_ocfs2:promote cl_ocfs2mgmt:start cl_fs_ocfs2:start
order tomcatlast1 inf: cl_fs_ocfs2 cl-odp
order tomcatlast2 inf: cl_fs_ocfs2 cl-odpws
property $id=cib-bootstrap-options \
dc-version=1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60 \
cluster-infrastructure=openais \
expected-quorum-votes=2 \
no-quorum-policy=ignore \
start-failure-is-fatal=false \
stonith-action=reboot \
stonith-enabled=false \
last-lrm-refresh=1317207361___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: 

Re: [Pacemaker] Dual-Primary DRBD with OCFS2 on SLES 11 SP1

2011-09-29 Thread Darren.Mansell
Sorry for top-posting, I'm Outlook-afflicted.

This is also my problem; In the full production environment there will be 
low-level hardware fencing by means of IBM RSA/ASM but this is a VMware test 
environment. The vmware STONITH plugin is dated and doesn't seem to work 
correctly (I gave up quickly due to the author of the plugin stating on this 
list that it probably won't work) and SSH STONITH seems to have been removed, 
not that it would do much good in this circumstance.

Therefore, there's no way to set up STONITH in a test environment in VMware 
which is where I believe a lot of people architect solutions these days, so 
there's no way to prove a solution works.

I'll attempt to modify and improve the VMware STONITH agent but I'm not sure 
how in this situation where a node has gone away and left a single remaining 
node, but the remaining node is then failing, how STONITH could help? Is this 
where the suicide agent comes in?

Regards,
Darren

-Original Message-
From: Nick Khamis [mailto:sym...@gmail.com] 
Sent: 29 September 2011 15:48
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Dual-Primary DRBD with OCFS2 on SLES 11 SP1

Hello Dejan,

Sorry to hijack, I am also working on the same type of setup as a prototype.
What is the best way to get stonith included for VM setups? Maybe an SSH 
stonith?
Again, this is just for the prototype.

Cheers,

Nick.

On Thu, Sep 29, 2011 at 9:28 AM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 Hi Darren,

 On Thu, Sep 29, 2011 at 02:15:34PM +0100, darren.mans...@opengi.co.uk wrote:
 (Originally sent to DRBD-user, reposted here as it may be more 
 relevant)




 Hello all.



 I'm implementing a 2-node cluster using Corosync/Pacemaker/DRBD/OCFS2 
 for dual-primary shared FS.



 I've followed the instructions on the DRBD applications site and it 
 works really well.



 However, if I 'pull the plug' on a node, the other node continues to 
 operate the clones, but the filesystem is locked and inaccessible 
 (the monitor op works for the filesystem, but fails for the OCFS2 
 resource.)



 If I do a reboot one node, there are no problems and I can continue 
 to access the OCFS2 FS.



 After I pull the plug:



 Online: [ test-odp-02 ]

 OFFLINE: [ test-odp-01 ]



 Resource Group: Load-Balancing

      Virtual-IP-ODP     (ocf::heartbeat:IPaddr2):       Started
 test-odp-02

      Virtual-IP-ODPWS   (ocf::heartbeat:IPaddr2):       Started
 test-odp-02

      ldirectord (ocf::heartbeat:ldirectord):    Started test-odp-02

 Master/Slave Set: ms_drbd_ocfs2 [p_drbd_ocfs2]

      Masters: [ test-odp-02 ]

      Stopped: [ p_drbd_ocfs2:1 ]

 Clone Set: cl-odp [odp]

      Started: [ test-odp-02 ]

      Stopped: [ odp:1 ]

 Clone Set: cl-odpws [odpws]

      Started: [ test-odp-02 ]

      Stopped: [ odpws:1 ]

 Clone Set: cl_fs_ocfs2 [p_fs_ocfs2]

      Started: [ test-odp-02 ]

      Stopped: [ p_fs_ocfs2:1 ]

 Clone Set: cl_ocfs2mgmt [g_ocfs2mgmt]

      Started: [ test-odp-02 ]

      Stopped: [ g_ocfs2mgmt:1 ]



 Failed actions:

     p_o2cb:0_monitor_1 (node=test-odp-02, call=19, rc=-2, 
 status=Timed Out): unknown

 exec error





 test-odp-02:~ # mount

 /dev/drbd0 on /opt/odp type ocfs2
 (rw,_netdev,noatime,cluster_stack=pcmk)



 test-odp-02:~ # ls /opt/odp

 ...just hangs forever...



 If I then power test-odp-01 back on, everything fails back fine and 
 the ls command suddenly completes.



 It seems to me that OCFS2 is trying to talk to the node that has 
 disappeared and doesn't time out. Does anyone have any ideas? 
 (attached CRM and DRBD configs)

 With stonith disabled, I doubt that your cluster can behave as it 
 should.

 Thanks,

 Dejan



 Many thanks.



 Darren Mansell





 Content-Description: crm.txt
 node test-odp-01
 node test-odp-02 \
         attributes standby=off
 primitive Virtual-IP-ODP ocf:heartbeat:IPaddr2 \
         params lvs_support=true ip=2.21.15.100 cidr_netmask=8 
 broadcast=2.255.255.255 \
         op monitor interval=1m timeout=10s \
         meta migration-threshold=10 failure-timeout=600
 primitive Virtual-IP-ODPWS ocf:heartbeat:IPaddr2 \
         params lvs_support=true ip=2.21.15.103 cidr_netmask=8 
 broadcast=2.255.255.255 \
         op monitor interval=1m timeout=10s \
         meta migration-threshold=10 failure-timeout=600
 primitive ldirectord ocf:heartbeat:ldirectord \
         params configfile=/etc/ha.d/ldirectord.cf \
         op monitor interval=2m timeout=20s \
         meta migration-threshold=10 failure-timeout=600
 primitive odp lsb:odp \
         op monitor interval=10s enabled=true timeout=10s \
         meta migration-threshold=10 failure-timeout=600
 primitive odpwebservice lsb:odpws \
         op monitor interval=10s enabled=true timeout=10s \
         meta migration-threshold=10 failure-timeout=600
 primitive p_controld ocf:pacemaker:controld \
         op monitor interval=10s enabled=true timeout=10s \
         meta migration-threshold=10 failure-timeout=600
 

Re: [Pacemaker] Help With Cluster Failure

2011-04-08 Thread Darren.Mansell
-Original Message-
From: Andrew Beekhof [mailto:and...@beekhof.net] 
Sent: 08 April 2011 08:15
To: The Pacemaker cluster resource manager
Cc: Darren Mansell
Subject: Re: [Pacemaker] Help With Cluster Failure

On Thu, Apr 7, 2011 at 12:12 PM,  darren.mans...@opengi.co.uk wrote:
 Hi all.



 One of my clusters had a STONITH shoot-out last night and then refused

 to do anything but sit there from 0400 until 0735 after I'd been woken

 up to fix it.



 In the end, just a resource cleanup fixed it, which I don't think 
 should be the case.



 I have an 8MB hb_report file. Is that too big to attach to send here? 
 Should I upload it somewhere?

Is there somewhere you can put it and send us a URL?



Absolutely. Thanks Andrew.

www.mysqlsimplecluster.com/HB_report/DM_report_1.tar.bz2 

Darren

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Help With Cluster Failure

2011-04-07 Thread Darren.Mansell
Hi all. 

 

One of my clusters had a STONITH shoot-out last night and then refused
to do anything but sit there from 0400 until 0735 after I'd been woken
up to fix it.

 

In the end, just a resource cleanup fixed it, which I don't think should
be the case.

 

I have an 8MB hb_report file. Is that too big to attach to send here?
Should I upload it somewhere?

 

Thanks.

Darren Mansell

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] IPaddr2 Netmask Bug Fix Issue

2011-03-30 Thread Darren.Mansell
From: Pavel Levshin [mailto:pa...@levshin.spb.ru] 
Sent: 25 March 2011 19:50
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] IPaddr2 Netmask Bug Fix Issue

 

25.03.2011 18:47, darren.mans...@opengi.co.uk: 





We configure a virtual IP on the non-arping lo interface of both servers
and then configure the IPaddr2 resource with lvs_support=true. This RA
will remove the duplicate IP from the lo interface when it becomes
active. Grouping the VIP with ldirectord/LVS we can have the
load-balancer and VIP on one node, balancing traffic to the other node
with failover where both resources failover together.

 

To do this we need to configure the VIP on lo as a 32 bit netmask but
the VIP on the eth0 interface needs to have a 24 bit netmask. This has
worked fine up until now and we base all of our clusters on this method.
Now what happens is that the find_interface() routine in IPaddr2 doesn't
remove the IP from lo when starting the VIP resource as it can't find it
due to the netmask not matching.


Do you really need the address to be deleted from lo? Having two
identical addresses on the Linux machine should not harm, if routing was
not affected. In your case, with /32 netmask on lo, I do not foresee any
problems.

We use it in this way, i.e. with the address set on lo permanently.


--
Pavel Levshin

 

 

Thanks Pavel,

 

However, this means I would have to disable LVS support for the
resource. Which means that to make it work with LVS I have to set
lvs_support to false.

 

Of course, I'll do whatever it takes on my set up to make it work, but
it's not intuitive for other users.

 

Regards

Darren Mansell

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Lots of Issues with Live Pacemaker Cluster

2011-03-15 Thread Darren.Mansell
湏䴠湯‬〲ㄱ〭ⴳ㐱愠⁴㈱ㄺ‹〫〱ⰰ䄠摮敲⁷敂步潨⁦牷瑯㩥㸊㸠ㄠ‮†††剄䑂搠敯湳璢瀠潲潭整搯浥瑯⁥潣牲捥汴⹹圠敨敮敶⁲⁉慨敶愠㸊映楡潬敶Ⱳ㸊㸠琠敨䐠䉒⁄敲潳牵散眠汩番瑳猠瑩琠敨敲漠桴⁥牷湯⁧潮敤‬潨摬湩⁧灵㸊愠汬㸊㸠漠桴牥漠数慲楴湯⹳䤠ꉴ⁳楬敫琠敨搠浥瑯⁥敮敶⁲慨灰湥⹳丠瑯楨杮椠ੳ‾潬杧摥眠敨੮‾‾桴獩栠灡数獮‬瑩樠獵⁴楳獴映牯癥牥眠瑩⁨慨晬漠⁦桴⁥敲潳牵散ੳ‾瑳灯数⁤湡੤‾‾剄䑂洠獡整⁲湯琠敨眠潲杮渠摯⹥㸊ਠ‾潆⁲桴獩愠⁴敬獡⁴❉⁤湥潣牵条⁥⁡畢⁧敲潰瑲眠瑩⁨⁡扨牟灥牯⁴牡档癩⹥㸊圠瑩潨瑵琠敨氠杯ⱳ琠敨挠湯楦畧慲楴湯愠潬敮眠湯⁴整汬甠⁳畭档‮ਊ桁‮⁉慷⁳潬歯湩⁧潦⁲慨牟灥牯⁴湡⁤獡畳敭⁤瑩栠摡戠敥敲潭敶⁤牦浯洠੹慰正条獥漠⁲慷湳琧椠桴⁥慰正条獥䤠栠摡‮❉汬搠桴瑡渠硥⁴楴敭䤠挠湡焊慵瑮晩⁹硥捡汴⁹桷湥椠⁴慨灰湥⹳ਊ敒慧摲ⱳ䐊牡敲੮
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Lots of Issues with Live Pacemaker Cluster

2011-03-15 Thread Darren.Mansell
On Mon, 2011-03-14 at 17:35 +0100, Dejan Muhamedagic wrote:
 Hi,
 
 On Mon, Mar 14, 2011 at 10:57:27AM -, darren.mans...@opengi.co.uk wrote:
  Hello everyone.
  
   
  
  I built and put into production without adequate testing a 2 node
  cluster running Ubuntu 10.04 LTS with Pacemaker and associated packages
  from the Ubuntu-HA-maintainers repo
  (https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa). 
 
 Not good to go live without sufficient testing. Testing is as
 important as anything else. Or even more important. If there
 isn't enough time for testing, then better to go without
 clustering.

I've very quickly realised this fact. Even if under pressure to put a
cluster live, don't give in until you're 100% happy with it. It WILL
bite you, and it won't be anyone else's fault but yours.

 
  2.   Crm shell won't load from a text file. When I use crm configure
   crm.txt, it will run through the file, complaining about the default
  timeout being less than 240, but doesn't load anything. So I go into the
  crm shell and set default-action-timeout to 240, commit and exit and do
  the same. This time it just exits silently, without loading the config.
 
 Strange. I assume that you run version 1.0.x which I don't use
 very often, but I cannot recall seeing this problem.

I'm not sure if I need to put a commit at the end of the input file? I
always assumed it had an implicit commit. I'll test this next time I get
chance.

 
  If I go into the crm shell and use load replace crm.txt it will work.
 
 Loading from a file was really meant to be done with configure
 load. Now, if there are errors/warnings in the configuration,
 what happens depends on check-* options for semantic checks.

I'll try that armed with this information next time.

 
  3.   Crm shell tab completes don't work unless you put an incorrect
  entry in first. I'm sure this is a python readline problem, as it also
  happens in SLE 11 HAE SP1 (but not in pre-SP1). I assume everyone
  associated (Dejan?) is aware of the problem, but highlighting it just in
  case.
 
 No, I'm not aware of it. Tab completion works here, though a bit
 differently from 1.0 due to lazy creation of the completion
 tables. You need to enter another level at least once before the
 tab completion is going to work for that level. For instance,
 it won't work in this case:
 
 crm(live)# resource TABTAB
 
 But it would once the user enters the resource level:
 
 crm(live)resource# TABTAB
 bye   failcount move  restart unmigrate 
 cdhelp  param show  unmove 
 cleanup   list  promote   start up 
 demotemanagequit  status  utilization 
 end   meta  refresh   stop  
 exit  migrate   reprobe   unmanage  
 
 Can you elaborate put incorrect entry first?

I think this is more down to my lack of understanding of how it's
changed then. I'm used to  1.0 clusters and the crm shell would always
tab complete *almost* everything. IIRC only location score rules etc
wouldn't.

However, I think my confusion has arisen due to this behaviour:

crm(live)# resource miTABTAB
nothing
crm(live)# resource mienter
ERROR: syntax: mi
crm(live)# resource miTAB
crm(live)# resource migrate

It will tab-complete the first and second level, if you've already
entered an incorrect parameter.

Regards,
Darren Mansell

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Lots of Issues with Live Pacemaker Cluster

2011-03-15 Thread Darren.Mansell
I'm sorry if it came through to you that way. It's the challenges I face as an 
Enterprise IT worker using Linux as my desktop. Either I use Outlook in a 
Windows VM and top post like this, or use Evolution, quote correctly but 
potentially cause encoding/unicode issues..

Regards,
Darren Mansell

-Original Message-
From: Digimer [mailto:li...@alteeve.com] 
Sent: 15 March 2011 14:41
To: The Pacemaker cluster resource manager
Cc: Darren Mansell; and...@beekhof.net
Subject: Re: [Pacemaker] Lots of Issues with Live Pacemaker Cluster

On 03/15/2011 10:15 AM, darren.mans...@opengi.co.uk wrote:
 Gibberish

I'm not sure what that was supposed to be about, but it doesn't look like sane 
Chinese. Despite a few kana, it certainly wasn't Japanese. If it was spam...

--
Digimer
E-Mail: digi...@alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Linux-HA] Solved: SLES 11 HAE SP1 Signon to CIB Failed

2011-02-09 Thread Darren.Mansell
  
 So I compared the /etc/ais/openais.conf in non-sp1 with 
 /etc/corosync/corosync.conf from sp1 and found this bit missing which 
 could be quite useful...
  
 service { 
 # Load the Pacemaker Cluster Resource Manager 
 ver:   0 
 name:  pacemaker 
 use_mgmtd: yes 
 use_logd:  yes
 }
  
 Added it and it works. Doh. 
  
 It seems the example corosync.conf that is shipped won't start 
 pacemaker, I'm not sure if that's on purpose or not, but I found it a 
 bit confusing after being used to it 'just working' previously.

Ah.  Understandably confusing.  That got fixed post-SP1, in a
maintenance update that went out in September or thereabouts.

Regards,

Tim


--
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.


---

Thanks Tim.

Although the media that can be downloaded *now* from Novell downloads
still has this issue, so any new clusters will fall foul of it.
Generally with a test build you won't perform updates as it burns a
licence you would need for the production system. Should the
downloadable media have the issue fixed?

Regards,
Darren

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Solved: [Linux-HA] SLES 11 HAE SP1 Signon to CIB Failed

2011-02-03 Thread Darren.Mansell

On Fri, Jan 28, 2011 at 1:06 PM,  darren.mans...@opengi.co.uk wrote:
 Hi all, this seems like it should be an easy one to fix, I'll raise a 
 support call with Novell if required.



 Base install of SLES 11 32 bit SP1 with HAE SP1 and crm_mon gives 
 'signon to CIB failed'. Same thing with the CRM shell etc.

Too many open file descriptors?
lsof might show something interesting



---


Unfortunately not.

It seems that corosync doesn't spawn anything else, which is causing
this issue:

From a SLES 11 HAE install:

root  7342  5.6  0.2 166048 38924 ?SLl   2010 5685:08
aisexec
root  7349  0.0  0.0  67768 10516 ?SLs   2010   3:02  \_
/usr/lib64/heartbeat/stonithd
907350  0.0  0.0  65028  4656 ?S 2010   7:43  \_
/usr/lib64/heartbeat/cib
nobody7351  0.0  0.0  61600  1832 ?S 2010   8:24  \_
/usr/lib64/heartbeat/lrmd
907352  0.0  0.0  66284  2320 ?S 2010   0:00  \_
/usr/lib64/heartbeat/attrd
907353  0.0  0.0  67536  3588 ?S 2010   1:24  \_
/usr/lib64/heartbeat/pengine
907354  0.0  0.0  72392  3712 ?S 2010   6:01  \_
/usr/lib64/heartbeat/crmd
root  7355  0.0  0.0  75148  2504 ?S 2010   2:25  \_
/usr/lib64/heartbeat/mgmtd
root  4040  0.0  0.0  0 0 ?Z 2010   0:00  \_
[aisexec] defunct
root  4059  0.0  0.0  0 0 ?Z 2010   0:00  \_
[aisexec] defunct

From a SLES 11 SP1 HAE install:

root  9109  0.0  0.4  13308  2288 tty1 Ss+  Feb02   0:00  \_
-bash
root  8989  0.0  0.1   4344   744 tty2 Ss+  Feb02   0:00
/sbin/mingetty tty2
root  8990  0.0  0.1   4344   752 tty3 Ss+  Feb02   0:00
/sbin/mingetty tty3
root  8991  0.0  0.1   4344   748 tty4 Ss+  Feb02   0:00
/sbin/mingetty tty4
root  8992  0.0  0.1   4344   748 tty5 Ss+  Feb02   0:00
/sbin/mingetty tty5
root  8993  0.0  0.1   4344   744 tty6 Ss+  Feb02   0:00
/sbin/mingetty tty6
root 24883  0.0  0.8  89808  4424 ?Ssl  Feb02   0:34
/usr/sbin/corosync
lookup-01:~ # 

So I compared the /etc/ais/openais.conf in non-sp1 with
/etc/corosync/corosync.conf from sp1 and found this bit missing which
could be quite useful...

service {
# Load the Pacemaker Cluster Resource Manager
ver:   0
name:  pacemaker
use_mgmtd: yes
use_logd:  yes
}

Added it and it works. Doh.

It seems the example corosync.conf that is shipped won't start
pacemaker, I'm not sure if that's on purpose or not, but I found it a
bit confusing after being used to it 'just working' previously.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] SLES 11 HAE SP1 Signon to CIB Failed

2011-01-28 Thread Darren.Mansell
Hi all, this seems like it should be an easy one to fix, I'll raise a
support call with Novell if required.

 

Base install of SLES 11 32 bit SP1 with HAE SP1 and crm_mon gives
'signon to CIB failed'. Same thing with the CRM shell etc.

 

All the logs look fine and I'm root. It's using corosync / pacemaker.

 

Any ideas? Has anyone seen this before?

 

thanks

 

Darren Mansell

 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [lvs-users] is it possible to have ldirector and real cluster server on same physical machine?

2010-12-06 Thread Darren.Mansell
Check the /var/log/ldirectord.log file for errors and check you can manually 
start it yourself: 

rcldirectord restart

I've had to compile a Perl module myself for ldirector in SLES 11 HAE: 
http://www.clusterlabs.org/wiki/Load_Balanced_MySQL_Replicated_Cluster#Missing_Perl_Socket6
 

You also need lvs_support=true in your ldirectord resource.

I've added this to the pacemaker list as it may be more suited for support 
there.

Darren Mansell


-Original Message-
From: lvs-users-boun...@linuxvirtualserver.org 
[mailto:lvs-users-boun...@linuxvirtualserver.org] On Behalf Of Mrvka Andreas
Sent: 06 December 2010 08:44
To: LinuxVirtualServer.org users mailing list.
Subject: Re: [lvs-users] is it possible to have ldirector and real cluster 
server on same physical machine?

Hello list,

sorrily I didn't succeed last week in deploying the cluster.
Please can anybody show me the error? It has to be somewhere very deep inside.

I only want to have a two node cluster with apache load balanced and 
fail-overing.
 It shouldn't be that complex - but where is the error?

Maby the solution or this configs will help others.

Here my ldirectord.cf (with TABs)
autoreload = yes
checkinterval = 10
checktimeout = 3
logfile = /var/log/ldirectord.log
quiescent = yes
virtual = 10.10.11.60:80
checktype = negotiate
fallback = 127.0.0.1:80
protocol = tcp
real = 10.10.11.61:80 gate
real = 10.10.11.62:80 gate
receive = Still alive
request = test.html
scheduler = wlc
service = http

My crm configure:

node linlbtemp01
node linlbtemp02
primitive ClusterIP ocf:heartbeat:IPaddr2 \
operations $id=ClusterIP-operations \
op monitor interval=5s timeout=20s \
params ip=10.10.11.60 nic=lo cidr_netmask=16 lvs_support=true
primitive Virtual-IP-Apache ocf:heartbeat:IPaddr2 \
params lvs_support=true ip=10.10.11.60 cidr_netmask=16 
broadcast=255.255.255.255 \
op monitor interval=1m timeout=10s \
meta migration-threshold=10
primitive apache ocf:heartbeat:apache \
op monitor interval=30s timeout=10s \
meta migration-threshold=10 target-role=Started \
params configfile=/etc/apache2/httpd.conf httpd=/usr/sbin/httpd 
testurl=/test.html
primitive ldirectord ocf:heartbeat:ldirectord \
params configfile=/etc/ha.d/ldirectord.cf \
op monitor interval=2m timeout=20s \
meta migration-threshold=10 target-role=Started
group Load-Balancing Virtual-IP-Apache ldirectord clone cl-apache apache 
location Prefer-Node1 ldirectord \
rule $id=prefer-node1-rule 100: #uname eq linlbtemp01 property 
$id=cib-bootstrap-options \
dc-version=1.1.2-ecb1e2ea172ba2551f0bd763e557fccde68c849b \
cluster-infrastructure=openais \
expected-quorum-votes=2

My /etc/sysctl:
# Disable response to broadcasts.
# You don't want yourself becoming a Smurf amplifier.
net.ipv4.icmp_echo_ignore_broadcasts = 1 # enable route verification on all 
interfaces net.ipv4.conf.all.rp_filter = 1 # enable ipV6 forwarding 
#net.ipv6.conf.all.forwarding = 1 # increase the number of possible inotify(7) 
watches fs.inotify.max_user_watches = 65536 # avoid deleting secondary IPs on 
deleting the primary IP #net.ipv4.conf.default.promote_secondaries = 1 
#net.ipv4.conf.all.promote_secondaries = 1 #net.ipv4.conf.lo.arp_ignore = 1 
#net.ipv4.conf.lo.arp_announce = 2 #net.ipv4.conf.all.arp_ignore = 1 
#net.ipv4.conf.all.arp_announce = 2 net.ipv4.conf.all.arp_ignore = 1 
net.ipv4.conf.eth0.arp_ignore = 1 net.ipv4.conf.all.arp_announce = 2 
net.ipv4.conf.eth0.arp_announce = 2 net.ipv4.ip_forward = 1


My ifcfg-lo:

IPADDR=127.0.0.1
NETMASK=255.0.0.0
NETWORK=127.0.0.0
BROADCAST=127.255.255.255
IPADDR_2=127.0.0.2/8
STARTMODE=onboot
USERCONTROL=no
FIREWALL=no
IPADDR_0=10.10.11.60   #VIP
NETMASK_0=255.255.255.255
NETWORK_0=10.10.11.0
BROADCAST_0=10.10.11.255
LABEL_0=0


Actually it seems, that my ldirectord out of openais does not start.

Can anybody point me to the error?

Thanks a lot in advance.
Andrew


-Original Message-
From: lvs-users-boun...@linuxvirtualserver.org 
[mailto:lvs-users-boun...@linuxvirtualserver.org] On Behalf Of 
darren.mans...@opengi.co.uk
Sent: Freitag, 3. Dezember 2010 14:53
To: lvs-us...@linuxvirtualserver.org
Subject: Re: [lvs-users] is it possible to have ldirectorand realcluster server 
on same physical machine?

Glad it helped. This is my original howto for this kind of setup:

http://www.clusterlabs.org/wiki/Load_Balanced_MySQL_Replicated_Cluster 

darren


-Original Message-
From: lvs-users-boun...@linuxvirtualserver.org 
[mailto:lvs-users-boun...@linuxvirtualserver.org] On Behalf Of Mrvka Andreas
Sent: 03 December 2010 13:46
To: 'LinuxVirtualServer.org users mailing list.'
Subject: Re: [lvs-users] is it possible to have ldirectorand realcluster server 
on same physical machine?

Hi Darren,

thank you for the detailed infos.
I've read out of your messages that in 

[Pacemaker] Help with understanding CIB scores

2010-07-05 Thread Darren.Mansell
Hi all.

 

Could anyone give me any pointers on how to easily find out what is
stopping resources moving to a preferred node as expected?

 

I'm looking at the ptest -Ls output and can see there is a greater score
for a resource on another node than the node I am specifically locating.
I can't see in the logs using grep lrmd.*$res /var/log/syslog anything
that would indicate what's going wrong.

 

I'm using Pacemaker 1.0.8+hg15494-2ubuntu2 on Ubuntu Lucid (10.04) with
quite a large CIB.

 

Many thanks.

Darren Mansell

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Time/Date Based Expressions in the CRM Shell

2010-03-31 Thread Darren.Mansell
Apologies if this is in the documentation but I can't see how to use the
time/date based expression resource constraints in the CRM shell.

 

Can anyone provide an example config or point me to any documentation
for how to use it?

 

I'm trying to use these constraints to run scripts at certain times
using the Anything RA, essentially using Pacemaker like an advanced
cron. Does this sound like the right way to do it?

 

Cheers

Darren

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Time/Date Based Expressions in the CRM Shell

2010-03-31 Thread Darren.Mansell
 -Original Message-
 From: Dejan Muhamedagic [mailto:deja...@fastmail.fm]
 Sent: 31 March 2010 11:09
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Time/Date Based Expressions in the CRM Shell
 Hi,
 
 On Wed, Mar 31, 2010 at 10:56:29AM +0100, darren.mans...@opengi.co.uk
 wrote:
 Apologies if this is in the documentation but I can't see how to use
 the time/date based expression resource constraints in the CRM shell.
 
 There's usage in the crm shell documentation and help for the location
 constraint (crm configure help location). There's nothing about the
date
 format, but you should be able to find about that in the configuration
 explained document. It's some ISO standard (looks a bit awkward).
 
 Thanks,
 
 Dejan
 

I've added the resource and location constraint:

primitive QuoteCountGrab ocf:heartbeat:anything \
params binfile=/usr/local/bin/qc-grab.sh errlogfile=/var/log/qc-grab.err
logfile=/var/log/qc-grab.log pidfile=/var/run/qc-grab.pid user=root

location QuoteCountGrabSchedule QuoteCountGrab rule inf: date date_spec
hours=0-4

but it seems to have started the resource right away and ignored the
date_spec. I've read the section on ensuring time based rules take
effect and Ive set cluster-recheck-interval to 5m but the resource is
still running.

Any ideas?

Thanks
Darren 

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] DRBD Recovery Policies

2010-03-12 Thread Darren.Mansell
The odd thing is - it didn't. From my test, it failed back, re-promoted
NodeA to be the DRBD master and failed all grouped resources back too.

Everything was working with the ~7GB of data I had put onto NodeB while
NodeA was down, now available on NodeA...

/proc/drbd on the slave said Secondary/Primary UpToDate/Inconsistent
while it was syncing data back - so it was able to mount the
inconsistent data on the primary node and access the files that hadn't
yet sync'd over?! I mounted a 4GB ISO that shouldn't have been able to
be there yet and was able to access data inside it..

Is my understanding of DRBD limited and it's actually able to provide
access to not fully sync'd files over the network link or something?

If so - wow.

I'm confused ;)


-Original Message-
From: Menno Luiten [mailto:mlui...@artifix.net] 
Sent: 11 March 2010 19:35
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] DRBD Recovery Policies

Hi Darren,

I believe that this is handled by DRBD by fencing the Master/Slave 
resource during resync using Pacemaker. See 
http://www.drbd.org/users-guide/s-pacemaker-fencing.html. This would 
prevent Node A to promote/start services with outdated data 
(fence-peer), and it would be forced to wait with takeover until the 
resync is completed (after-resync-target).

Regards,
Menno

Op 11-3-2010 15:52, darren.mans...@opengi.co.uk schreef:
 I've been reading the DRBD Pacemaker guide on the DRBD.org site and
I'm
 not sure I can find the answer to my question.

 Imagine a scenario:

 (NodeA

 NodeB

 Order and group:

 M/S DRBD Promote/Demote

 FS Mount

 Other resource that depends on the F/S mount

 DRBD master location score of 100 on NodeA)

 NodeA is down, resources failover to NodeB and everything happily runs
 for days. When NodeA is brought back online it isn't treated as
 split-brain as a normal demote/promote would happen. But the data on
 NodeA would be very old and possibly take a long time to sync from
NodeB.

 What would happen in this scenario? Would the RA defer the promote
until
 the sync is completed? Would the inability to promote cause the
failback
 to not happen and a resource cleanup is required once the sync has
 completed?

 I guess this is really down to how advanced the Linbit DRBD RA is?

 Thanks

 Darren



 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] DRBD Recovery Policies

2010-03-12 Thread Darren.Mansell
Fairly standard, but I don't really want it to be fenced, as I want to
keep the data that has been updated on the single remaining nodeB while
NodeA was being repaired:

global {
  dialog-refresh   1;
  minor-count  5;
}
common {
  syncer { rate 10M; }
}
resource cluster_disk {
  protocol  C;
  disk {
 on-io-error   pass_on;
  }
  syncer {
  }
handlers {
  split-brain /usr/lib/drbd/notify-split-brain.sh root;
  }
net {
 after-sb-1pri discard-secondary;
  }
startup {
 wait-after-sb; 
 }
  on cluster1 {
 device/dev/drbd0;
 address   12.0.0.1:7789;
 meta-disk internal;
 disk  /dev/sdb1;
  }
  on cluster2 {
 device/dev/drbd0;
 address   12.0.0.2:7789;
 meta-disk internal;
 disk  /dev/sdb1;
  }
}



-Original Message-
From: Menno Luiten [mailto:mlui...@artifix.net] 
Sent: 12 March 2010 10:05
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] DRBD Recovery Policies

Are you absolutely sure you set the resource-fencing parameters 
correctly in your drbd.conf (you can post your drbd.conf if unsure) and 
reloaded the configuration?

On 12-03-10 10:48, darren.mans...@opengi.co.uk wrote:
 The odd thing is - it didn't. From my test, it failed back,
re-promoted
 NodeA to be the DRBD master and failed all grouped resources back too.

 Everything was working with the ~7GB of data I had put onto NodeB
while
 NodeA was down, now available on NodeA...

 /proc/drbd on the slave said Secondary/Primary UpToDate/Inconsistent
 while it was syncing data back - so it was able to mount the
 inconsistent data on the primary node and access the files that hadn't
 yet sync'd over?! I mounted a 4GB ISO that shouldn't have been able to
 be there yet and was able to access data inside it..

 Is my understanding of DRBD limited and it's actually able to provide
 access to not fully sync'd files over the network link or something?

 If so - wow.

 I'm confused ;)


 -Original Message-
 From: Menno Luiten [mailto:mlui...@artifix.net]
 Sent: 11 March 2010 19:35
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] DRBD Recovery Policies

 Hi Darren,

 I believe that this is handled by DRBD by fencing the Master/Slave
 resource during resync using Pacemaker. See
 http://www.drbd.org/users-guide/s-pacemaker-fencing.html. This would
 prevent Node A to promote/start services with outdated data
 (fence-peer), and it would be forced to wait with takeover until the
 resync is completed (after-resync-target).

 Regards,
 Menno

 Op 11-3-2010 15:52, darren.mans...@opengi.co.uk schreef:
 I've been reading the DRBD Pacemaker guide on the DRBD.org site and
 I'm
 not sure I can find the answer to my question.

 Imagine a scenario:

 (NodeA

 NodeB

 Order and group:

 M/S DRBD Promote/Demote

 FS Mount

 Other resource that depends on the F/S mount

 DRBD master location score of 100 on NodeA)

 NodeA is down, resources failover to NodeB and everything happily
runs
 for days. When NodeA is brought back online it isn't treated as
 split-brain as a normal demote/promote would happen. But the data on
 NodeA would be very old and possibly take a long time to sync from
 NodeB.

 What would happen in this scenario? Would the RA defer the promote
 until
 the sync is completed? Would the inability to promote cause the
 failback
 to not happen and a resource cleanup is required once the sync has
 completed?

 I guess this is really down to how advanced the Linbit DRBD RA is?

 Thanks

 Darren



 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] DRBD Recovery Policies

2010-03-11 Thread Darren.Mansell
I've been reading the DRBD Pacemaker guide on the DRBD.org site and I'm
not sure I can find the answer to my question.

 

Imagine a scenario:

 

(NodeA

NodeB

 

Order and group:

M/S DRBD Promote/Demote

FS Mount

Other resource that depends on the F/S mount

 

DRBD master location score of 100 on NodeA)

 

NodeA is down, resources failover to NodeB and everything happily runs
for days. When NodeA is brought back online it isn't treated as
split-brain as a normal demote/promote would happen. But the data on
NodeA would be very old and possibly take a long time to sync from
NodeB.

 

What would happen in this scenario? Would the RA defer the promote until
the sync is completed? Would the inability to promote cause the failback
to not happen and a resource cleanup is required once the sync has
completed?

 

I guess this is really down to how advanced the Linbit DRBD RA is?

 

Thanks

Darren

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] DRBD and fencing

2010-03-10 Thread Darren.Mansell

On Wed, Mar 10, 2010 at 02:32:05PM +0800, Martin Aspeli wrote:
 Florian Haas wrote:
 On 03/09/2010 06:07 AM, Martin Aspeli wrote:
 Hi folks,

 Let's say have a two-node cluster with DRBD and OCFS2, with a
database
 server that's supposed to be active on one node at a time, using the
 OCFS2 partition for its data store.
 *cringe* Which database is this?

 Postgres.

 Why are you cringing? From my reading, I had gathered this was a
pretty
 common setup to support failover of Postgres without the luxury of a
 SAN. Are you saying it's a bad idea?

PgSQL on top of DRBD is OK.  PgSQL on top of OCFS2 is a disaster waiting
to
gnaw your leg off.


--

Please forgive my ignorance, I seem to have missed the specifics about
using OCFS2 on DRBD dual-primary but what are the main issues? How can
you use PgSQL on dual-primary without OCFS2?

Thanks
Darren

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Help with OCFS2 / DLM Stability

2010-03-10 Thread Darren.Mansell
Sorry, please ignore this mail. Client issues!


-Original Message-
From: darren.mans...@opengi.co.uk [mailto:darren.mans...@opengi.co.uk] 
Sent: 10 March 2010 13:53
To: deja...@fastmail.fm
Cc: pacemaker@oss.clusterlabs.org
Subject: Re: Re: [Pacemaker] Help with OCFS2 / DLM Stability

On Wed, 2010-03-10 at 13:28 +0100, Dejan Muhamedagic wrote:
 Hi,
=20
 On Tue, Mar 09, 2010 at 11:37:02AM -, darren.mans...@opengi.co.uk 
wro=
te:
  Hi everyone.
 =20
  =20
 =20
  Further to some discussions a couple of weeks ago with regard to 
 OCFS2  on SLES 11 HAE I'm looking to finally nail this problem.
 =20
  We have a 3 node cluster that has a STONITH shootout every week. 
 This  morning one node got stuck in a state where it couldn't be 
 fenced due  the RSA not being responsive.
 =20
  I'm not sure if the problem is due to:
 =20
  * Network interruption causing Totem failures.
  * Java (Tomcat) processes falling over.
=20
 I suppose that those are activequote and activequoteadmin. You  should

increase the timeouts, 10 seconds is too short in general,  and for 
java/tomcat probably even more so.
=20
  * DLM falling over.
  * Any of the above in any combination.
 =20
  I've attached a hb_report. Could you see if you can see anything?
=20
 Any good reason to ignore quorum? For a three node cluster you  should

remove the no-quorum-policy property or, perhaps because  of ocfs2, set

it to freeze.
=20
 Pacemaker is 1.0.3, perhaps it's time to upgrade too. There is a
 SLE11 HAE update available.
=20
 From the logs:
=20
 Mar  9 06:28:43 OGG-ACTIVEQUOTE-02 pengine: [5540]: WARN: 
unpack_rsc_op: =
Processing failed op activequote:1_monitor_1 on OGG-ACTIVEQUOTE-03:
unk= nown exec error
=20
 Interestingly, there is no lrmd log for this on 03.
=20
 Then there are several operation timeouts, perhaps due to ocfs2  
hanging, two activequote and activequoteadmin stop operations  could 
not be killed even with -9, so they were probably waiting  for the 
disk.
=20
 Mar  9 06:29:40 OGG-ACTIVEQUOTE-02 openais[5439]: [crm  ] info: 
pcmk_peer=
_update: lost: OGG-ACTIVEQUOTE-03 504997642
=20
 Do you know why the node vanished? You should try to keep your  
networking healthy.
=20
 Thanks,
=20
 Dejan
=20
  =20
 =20
  Thanks
 =20
  Darren Mansell
 =20
 =20
 =20
  =20
 =20
=20
=20
  ___
  Pacemaker mailing list
  Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
=20
=20
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] cluster/load balancing in openvz containers

2010-02-25 Thread Darren.Mansell
That's about where I got to last time I looked at it. Openvz and
Linux-HA should be great together, there's just lots of little
configuration issues to get around. I went with the ldirector / ipvsadm
route but run into issues with ARP config and didn't really have enough
time to look into it.

If you could create a config where you have a cluster with VZ instances
controlled by RA's that do live migrations it would be fantastic, a
proper self-contained virtual machine cluster.


-Original Message-
From: wessel [mailto:wes...@techtribe.nl] 
Sent: 25 February 2010 10:57
To: pacemaker@oss.clusterlabs.org
Subject: [Pacemaker] cluster/load balancing in openvz containers

Hi,

I am trying to get load balancing working on my test configuration which

consist of openvz containers (I would even like to use openvz on the 
production machines if possible, makes configuration/migration etc
easy).

I have the HA part working, I created venet interfaces on the containers

en added them together with the host interface in a bridge on the host. 
This even works with 3 containers on 2 different hardware hosts.

The load balancing part is a bit more problematic: I created a clone of 
the ip and of the website, and it starts the web server on both 
containers, so till there it looks fine.
But if I do requests it always seems to come from the same node. Until I

put that node in standby, than I get the requests from another node.

My host is ubuntu hardy 8.04, my containers are debian lenny. Could the 
problem be because of the http://www.linux-ha.org/ClusterIP , 
ipt_CLUSTERIP which is probably missing in my container kernel? I don't 
get any error messages in the logfile. At least none that looks related.

Below is my config.

Thanks for any help/suggestions!

Wessel

node test2 \
 attributes standby=off
node test3 \
 attributes standby=off
node test4 \
 attributes standby=on
primitive Website ocf:heartbeat:apache \
 params configfile=/etc/apache2/apache2.conf \
 op monitor interval=10s
primitive failover-ip ocf:heartbeat:IPaddr \
 params ip=10.111.112.34 \
 op monitor interval=10s
clone WebIP failover-ip \
 meta globally-unique=true clone-max=2 clone-node-max=2
clone WebsiteClone Website
colocation website-with-ip inf: WebsiteClone WebIP
order apache-after-ip inf: WebIP WebsiteClone
property $id=cib-bootstrap-options \
 dc-version=1.0.7-54d7869bfe3691eb723b1d47810e5585d8246b58 \
 cluster-infrastructure=openais \
 expected-quorum-votes=3 \
 stonith-enabled=false \
 no-quorum-policy=ignore


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] OCFS2 fencing regulated by Pacemaker?

2010-02-11 Thread Darren.Mansell
Hello.

Yes, we get the same kind of thing. SLES11 HAE 64-bit.

Average uptime of the boxes is about a week at the moment. Also 3 nodes
using OCFS2 / cLVMD / OCFS2:

node OGG-NODE-01
node OGG-NODE-02 \  
attributes standby=off   
node OGG-NODE-03
primitive STONITH-1 stonith:external/ibmrsa-telnet \
params nodename=OGG-NODE-01 ip_address=192.168.1.12
password=PASSWORD username=USERID \
op monitor interval=1h timeout=1m \

op startup interval=0 timeout=1m \

meta target-role=Started

primitive STONITH-2 stonith:external/ibmrsa-telnet \

params nodename=OGG-NODE-02 ip_address=192.168.1.22
password=PASSWORD username=USERID \
op monitor interval=1h timeout=1m \

op startup interval=0 timeout=1m \

meta target-role=Started

primitive STONITH-3 stonith:external/ibmrsa-telnet \

params nodename=OGG-NODE-03 ip_address=192.168.1.32
password=PASSWORD username=USERID \   
op monitor interval=1h timeout=1m \

meta target-role=Started

primitive Virtual-IP-App1 ocf:heartbeat:IPaddr2 \

params lvs_support=true ip=192.168.1.100 cidr_netmask=24
broadcast=192.168.1.255 \   
op monitor interval=1m timeout=10s \

meta migration-threshold=10

primitive Virtual-IP-App2 ocf:heartbeat:IPaddr2 \

params lvs_support=true ip=192.168.1.103 cidr_netmask=24
broadcast=192.168.1.255 \   
op monitor interval=1m timeout=10s \

meta migration-threshold=10

primitive ldirectord ocf:heartbeat:ldirectord \

params configfile=/etc/ha.d/ldirectord.cf \

op monitor interval=2m timeout=20s \
meta migration-threshold=10 target-role=Started
primitive App1 lsb:App1 \
op monitor interval=10s enabled=true timeout=10s \
meta target-role=Started
primitive App2 lsb:App2 \
op monitor interval=10s enabled=true timeout=10s \
meta target-role=Started
primitive dlm ocf:pacemaker:controld \
op monitor interval=120s
primitive o2cb ocf:ocfs2:o2cb \
op monitor interval=2m
primitive fs ocf:heartbeat:Filesystem \
params device=/dev/dm-0 directory=/opt/SAN/ fstype=ocfs2 \
op monitor interval=120s
group Load-Balancing Virtual-IP-App1 Virtual-IP-App2 ldirectord
clone cl-App1 App1
clone cl-App2 App2
clone dlm-clone dlm \
meta globally-unique=false interleave=true
target-role=Started
clone o2cb-clone o2cb \
meta globally-unique=false interleave=true
target-role=Started
clone fs-clone fs \
meta interleave=true ordered=true target-role=Started
location l-st-1 STONITH-1 -inf: OGG-NODE-01
location l-st-2 STONITH-2 -inf: OGG-NODE-02
location l-st-3 STONITH-3 -inf: OGG-NODE-03
location Prefer-Node1 ldirectord \
rule $id=prefer-node1-rule 100: #uname eq OGG-NODE-01
colocation o2cb-with-dlm inf: o2cb-clone dlm-clone
colocation fs-with-o2cb inf: fs-clone o2cb-clone
order start-o2cb-after-dlm inf: dlm-clone o2cb-clone
order start-fs-after-o2cb inf: o2cb-clone fs-clone
order start-app1-after-fs inf: fs-clone cl-App1
order start-app2-after-fs inf: fs-clone cl-App2
property $id=cib-bootstrap-options \
dc-version=1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a \
expected-quorum-votes=3 \
no-quorum-policy=ignore \
start-failure-is-fatal=false \
stonith-action=reboot \
last-lrm-refresh=1265882628 \
stonith-enabled=true

We seem to have randomly picked up a standby=off node attribute, I
can't see it's causing any problems but I'm too afraid to make any
changes at the moment in case we have a(nother) shootout.

-Original Message-
From: Sander van Vugt [mailto:m...@sandervanvugt.nl] 
Sent: 11 February 2010 08:30
To: pacema...@clusterlabs.org
Subject: [Pacemaker] OCFS2 fencing regulated by Pacemaker?

Hi,

I'm trying to set up OCFS2 in a pacemaker environment (SLES11 with HAE),
in a 3 node cluster. Now I succesfully configured two volumes, the dlm
and the o2cb resource. But: if I shut down one of the nodes, at least
one other node (and sometimes even two!) is fencing itself. 

I've been looking for the a way to control this behavior, but can't find
anything. 

Does anyone have a clue?
Thanks,
Sander


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] OCFS2 fencing regulated by Pacemaker?

2010-02-11 Thread Darren.Mansell
Once again, I apologise for the top-posting. I wish I could use a real
mail client but nothing apart from Outlook works properly with Exchange
:(.

Anyway - Yes We've had a really hard time with our 3-node SAN based
cluster. We implemented OCFS2 on top of a shared disk using a o2cb and
dlm clones. It seemed to work in the test environment but then when live
it's been a real nightmare. It seems if you even breathe on it it will
start a shootout, but as it's now a production system I can't do much
about it.

Some mornings we arrive in and see that all 3 servers got STONITHd
overnight but we can't see any reason why. We would disable STONITH to
see what state the cluster gets in before fencing but the worst that
happens is we get 10 mins of service unavailability, which is a lot
better than 12 hours.

To complicate matters further, the apps we are using on the cluster /
shared storage are Tomcat based and allegedly don't work too well with
other file locking mechanisms. This is developer hearsay though, I can't
substantiate it. The only leads I have are that the dlm seems to lose
quorum and sets the fencing ops off. The logs never seem to tie up
though, so it's very difficult to fault find.

With all this in mind, I haven't been able to file any bugs or make
support requests to Novell due to not knowing exactly what is causing
the issue. At the moment, if we leave well alone it performs well. If I
was to have to reboot a node, I would expect the others get to be fenced
afterwards.

Thanks for the help
Darren

-Original Message-
From: Dejan Muhamedagic [mailto:deja...@fastmail.fm] 
Sent: 11 February 2010 14:12
To: pacemaker@oss.clusterlabs.org; m...@sandervanvugt.nl
Subject: Re: [Pacemaker] OCFS2 fencing regulated by Pacemaker?

Hi,

On Thu, Feb 11, 2010 at 01:16:20PM +0100, Sander van Vugt wrote:
 On Thu, 2010-02-11 at 13:03 +0100, Dejan Muhamedagic wrote:
  Hi,
  
  On Thu, Feb 11, 2010 at 10:11:33AM -,
darren.mans...@opengi.co.uk wrote:
   Hello.
   
   Yes, we get the same kind of thing. SLES11 HAE 64-bit.
  
  Is there a bugzilla for this?
  
 Nope. Before filing a bug, I'd first like to be as sure as possible
that
 it really is a bug and not a problem behind the keyboard. 

If you have strong doubts, closing a bugzilla is easy :) BTW,
this was meant for Darren actually, as it seemed like he was
having really hard time dealing with his cluster.

 BTW: I don't see where on bugzilla.novell.com I should enter a bug for
 something that is in the SLES HAE (and the Bugzilla FAQ didn't help
 me). 

Use SUSE Linux Enterprise High Availability Extension for the
product line.

Thanks,

Dejan

 Thanks,
 Sander
 
 
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Announce: Hawk (HA Web Konsole) 0.2.0

2010-02-10 Thread Darren.Mansell
On Tue, 2010-02-09 at 16:38 -0700, Tim Serong wrote:
=20
 So, by fixed I clearly meant fixed in only one of the two places
 that
 require fixing.  Please try the following change (the relevant file
 will
 be /srv/www/hawk/public/javascripts/application.js):=20

This now works great, thank you :)

(Sorry about the strange characters, can you get your colleagues to fix
Evolution? ;) )

Regards,
Darren
winmail.dat___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Announce: Hawk (HA Web Konsole) 0.2.0

2010-02-09 Thread Darren.Mansell
On Tue, 2010-02-09 at 04:06 -0700, Tim Serong wrote:
On 2/9/2010 at 09:15 PM, darren.mans...@opengi.co.uk wrote: 
  Hi Tim. Thanks for this project, it seems to be exactly what we're 
  looking for. 
 
 Well, I certainly hope so :)
 
  I've installed it (it required spawn-fcgi too on SLES11 64) but I
just 
  get a blank page. I've looked at the page source and the divs have 
  style=display: none. Not sure why that's happening, can you think
of 
  anything? 
 
 style=display: none is used in two cases; one is for unexpanded
 children of a collapsible panel (but the header will still be
visible).
 The other is if it thinks it can't see any useful information from
 cibadmin, in which case the expected behaviour would be an error
message
 of some description.
 
 Can you please tell me:
 
  - What version of Pacemaker you're running
  - If you run cibadmin -Ql | grep cluster-infrastructure, do you
see any output?  If so, what?
 
 Thanks,
 
 Tim
 
It's pacemaker-1.0.3-4.1

No output for cluster-infrastructure.

But the HTML source does contain information, just display: none hides
it:

div id=summary style=display: none
  table
trthStack:/th tdspan
id=summary::stack/span/td/tr
trthVersion:/th   tdspan
id=summary::version1.0.3-0080ec086ae9/span/td/tr
trthCurrent DC:/thtdspan
id=summary::dcdm-ha1/span/td/tr
trthStickiness:/thtdspan
id=summary::default_resource_stickiness0/span/td/tr
trthSTONITH:/th   tdspan
id=summary::stonith_enabledEnabled/span/td/tr
trthCluster is:/thtdspan
id=summary::symmetric_clusterSymmetric/span/td/tr
trthNo Quorum:/th tdspan
id=summary::no_quorum_policystop/td/tr
  /table
/div

Thanks
Darren

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] ocf:heartbeat:mysql RA: op monitor

2010-02-09 Thread Darren.Mansell
On Tue, 2010-02-09 at 17:01 +0100, Oscar Rem=C3=AD=C2=ADrez de Ganuza Satr=
=C3=BAstegui
wrote:
 Hello!
=20
 I have one question regarding the ocf:heartbeat:mysql RA.
=20
 I supposed that the following params defined on the resource were to=20
 being used by the monitor operation to check the status of the mysql=20
 service: test_passwd=3Dpassword test_table=3Dldirectord.connectionchec=
k=20
 test_user=3Dservicecheck
 And that's what the crm is telling me:
=20
 * test_user (string, [root]): MySQL test user
 MySQL test user
 * test_passwd (string): MySQL test user password
 MySQL test user password
 * test_table (string, [mysql.user]): MySQL test table
 Table to be tested in monitor statement (in database.table notation)
=20
 But in my tests, they are not working as expected: they are always=20
 telling me that the service is ok, even if i do not have the user=20
 servicecheck defined on the database, and also if I stop (kill=20
 -SIGSTOP) the mysql process.
=20
 How can they be used to check the status of the mysql service?
=20
 Thanks very much again!!
=20
 ---
 Oscar Rem=C3=ADrez de Ganuza
 Servicios Inform=C3=A1ticos
 Universidad de Navarra
 Ed. de Derecho, Campus Universitario
 31080 Pamplona (Navarra), Spain
 tfno: +34 948 425600 Ext. 3130
 http://www.unav.es/SI

Very odd. The RA does the following:

buf=3D`echo SELECT * FROM $OCF_RESKEY_test_table | mysql --user=3D$OCF_RE=
SKEY_test_user --password=3D$OCF_RESKEY_test_passwd --socket=3D$OCF_RESKEY_=
socket -O connect_timeout=3D1 21`
rc=3D$?
if [ ! $rc -eq 0 ]; then
ocf_log err MySQL $test_table monitor failed:;
if [ ! -z $buf ]; then ocf_log err $buf; fi
return $OCF_ERR_GENERIC;
else
ocf_log info MySQL monitor succeded;
return $OCF_SUCCESS;
fi

And if I kill the MySQL process on mine the monitor detects this. What's
in your logs?

Darren
winmail.dat___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [PATCH] Allow the user to insert a startup configuration

2009-12-15 Thread Darren.Mansell
Perhaps change:

crm_info(Using initial configuration file : %s,
 static_config_file);

To:

crm_warn(Using initial configuration file : %s,
 static_config_file);

?

Anyone who would know to put a static config file in there in the first
place would be proficient enough to look in the log file for clues about
why their CIB keeps resetting?


-Original Message-
From: Dejan Muhamedagic [mailto:deja...@fastmail.fm] 
Sent: 15 December 2009 10:48
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] [PATCH] Allow the user to insert a startup
configuration

Hi,

On Tue, Dec 15, 2009 at 11:37:48AM +0100, Andrew Beekhof wrote:
 Anyone else interested in this feature being added?

The configuration is not explicitely given to the cluster, but
placed in a file. What happens on next startup? Who removes the
file so that the cluster doesn't load it again? If the answer to
the last question is the admin, I'm against the feature.

Thanks,

Dejan

 On Dec 10, 2009, at 9:53 PM, frank.di...@bigbandnet.com wrote:
 
 # HG changeset patch
 # User Frank DiMeo frank.di...@bigbandnet.com
 # Date 1260478129 18000
 # Branch stable-1.0
 # Node ID e7067734add7f3b148cb534b85b5af256db9fad7
 # Parent  381160def02a34ae554637e0a26efda850ccc015
 initial load of static configuration file
 
 diff -r 381160def02a -r e7067734add7 cib/io.c
 --- a/cib/io.c   Thu Dec 10 09:07:45 2009 -0500
 +++ b/cib/io.c   Thu Dec 10 15:48:49 2009 -0500
 @@ -261,7 +261,7 @@
  crm_err(%s exists but does NOT contain valid XML. , filename);
  crm_warn(Continuing but %s will NOT used., filename);
  
 -} else if(validate_cib_digest(root, sigfile) == FALSE) {
 +} else if(sigfile  ( validate_cib_digest(root, sigfile) ==
 FALSE )) {
  crm_err(Checksum of %s failed!  Configuration contents
 ignored!, filename);
  crm_err(Usually this is caused by manual changes, 
  please refer to
 http://clusterlabs.org/wiki/FAQ#cib_changes_detected;);
 @@ -282,11 +282,12 @@
 readCibXmlFile(const char *dir, const char *file, gboolean
 discard_status)
 {
  int seq = 0;
 -char *filename = NULL, *sigfile = NULL;
 +char *filename = NULL, *sigfile = NULL, *static_config_file =
NULL;
  const char *name = NULL;
  const char *value = NULL;
  const char *validation = NULL;
  const char *use_valgrind = getenv(HA_VALGRIND_ENABLED);
 +   struct stat buf;
  
  xmlNode *root = NULL;
  xmlNode *status = NULL;
 @@ -300,7 +301,23 @@
  sigfile  = crm_concat(filename, sig, '.');
 
  cib_status = cib_ok;
 -root = retrieveCib(filename, sigfile, TRUE);
 +
 +   /*
 +   ** we might drop a static config file in there as a known
 startup point
 +   ** if we do, use it.  Its called file.xml.static_init
 +   */
 +   static_config_file = crm_concat(filename, static_init, '.');
 +
 +   crm_info(Looking for static initialization file : %s,
 static_config_file);
 +
 +   if(stat(static_config_file, buf) == 0) {
 +  crm_info(Using initial configuration file : %s,
 static_config_file);
 +  root = retrieveCib(static_config_file, NULL, TRUE);
 +   }
 +   else {
 +  crm_info(Using found configuration file : %s, filename);
 +  root = retrieveCib(filename, sigfile, TRUE);
 +   }
 
  if(root == NULL) {
  crm_warn(Primary configuration corrupt or unusable, trying
 backup...);
 @@ -308,7 +325,6 @@
  }
  
  while(root == NULL) {
 -struct stat buf;
  char *backup_file = NULL;
  crm_free(sigfile);
 
 @@ -409,6 +425,7 @@
  }
  }
 
 +crm_free(static_config_file);
  crm_free(filename);
  crm_free(sigfile);
  return root;
 
 -- Andrew
 
 
 
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] I Like This But...

2009-12-07 Thread Darren.Mansell
Depends on what you're doing with it to make it challenging or not. The
old Linux-HA had a very steep learning curve that isn't there as much
anymore. A decent level of networking knowledge is required but the
documentation is now excellent and with the CRM shell, Pacemaker is a
lot easier to work with.

At the moment it is a great opportunity - the technology isn't widely
known and people seem to get very scared when you start talking about
what it does. It means you can deliver great solutions for very low
costs and you're seen as very capable for understanding it, or at least
making it work.

But you shouldn't worry about how difficult configuring it is for other
people. If you can read the documentation, understand it and configure
it correctly then all management should be interested in is the result.
As you can get great results quickly and cheaply then they should have
no cause to complain and every cause to give you a pay rise :)

It won't be long until Pacemaker gets more exposure and then suddenly
every man and his dog is using and configuring it. Until then it's great
to have a technology you can rely upon to give you great results and
make you look good in the process..

Darren

-Original Message-
From: Fraser Doswell [mailto:doswe...@acanac.net] 
Sent: 05 December 2009 03:56
To: pacemaker@oss.clusterlabs.org
Subject: [Pacemaker] I Like This But...

While pacemaker is great, the process of configuring it is still
challenging - the pieces seem to be everywhere.

As a consultant, this is a great opportunity - pull the pieces together
and make it all work.

But, any normal systems admin with a micr-addicted manager breathing
down their neck
will have a hard time using this software. I was in the trenches once -
and understand why
they may not always read the docs. Someone is always interrupting - any
time of the day!

Nonetheless, systems admins - please start at the beginning. Doing too
much too fast
creates a house of cards that is not understood. And read, read, read...

THIS WILL TAKE TIME - THERE IS NO WAY AROUND IT

The software is already better than the competition. Keep up the great
work!

Fraser Doswell
Addington IR Inc.


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] is ptest 1.06 working correctly?

2009-11-30 Thread Darren.Mansell
I've never really understood the correct time to do the ptest graphs. I
initiated a failover once and did the graph very quickly while it was in
a transitional state but I've always wondered if there is an easier way
i.e. show me a graph of the migration plan if such and such were to
happen.

-Original Message-
From: Frank DiMeo [mailto:frank.di...@bigbandnet.com] 
Sent: 30 November 2009 16:56
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] is ptest 1.06 working correctly?

Actually, I don't know what you mean by the phrase you don't have any
transitions in live cib.  Shouldn't ptest generate a graphical
representation of the actions to be carried out on resources?

-Frank

 -Original Message-
 From: Rasto Levrinc [mailto:rasto.levr...@linbit.com]
 Sent: Monday, November 30, 2009 11:38 AM
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] is ptest 1.06 working correctly?
 
 
 On Mon, November 30, 2009 5:21 pm, Frank DiMeo wrote:
  I actually did use -- on the long options, for some reason the
  cut/paste in MS outlook collapsed them.  As you see from the
enclosed
  files in my previous posting, the files are actually generated,
 there's
  just not much in them.
 
 
 Oh, I see. It is because you don't have any transitions in live cib.
It
 works correctly as far as I can tell.
 
 Rasto
 
 
 --
 : Dipl-Ing Rastislav Levrinc
 : DRBD-MC http://www.drbd.org/mc/management-console/
 : DRBD/HA support and consulting http://www.linbit.com/
 DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.
 
 
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] is ptest 1.06 working correctly?

2009-11-30 Thread Darren.Mansell
This sounds very interesting. I look forward to trying it :)

(sorry for Outlook-affliction)

-Original Message-
From: Dejan Muhamedagic [mailto:deja...@fastmail.fm] 
Sent: 30 November 2009 17:28
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] is ptest 1.06 working correctly?

Hi,

On Mon, Nov 30, 2009 at 05:04:35PM -, darren.mans...@opengi.co.uk
wrote:
 I've never really understood the correct time to do the ptest graphs.
I
 initiated a failover once and did the graph very quickly while it was
in
 a transitional state but I've always wondered if there is an easier
way
 i.e. show me a graph of the migration plan if such and such were to
 happen.

There's a fairly new feature in the crm shell with which it
is possible to edit the status section, e.g. to simulate a
resource failure or the node lost event. Then you can try
the ptest command (in configure) and it will show you what
would happen. This feature has not been complete at the time
when 1.0.6 was released and may still change.

Also, if you change the configuration and run ptest _before_
commit, that will also display the graph of what would
happen if the new configuration had been committed.

Thanks,

Dejan

 -Original Message-
 From: Frank DiMeo [mailto:frank.di...@bigbandnet.com] 
 Sent: 30 November 2009 16:56
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] is ptest 1.06 working correctly?
 
 Actually, I don't know what you mean by the phrase you don't have any
 transitions in live cib.  Shouldn't ptest generate a graphical
 representation of the actions to be carried out on resources?
 
 -Frank
 
  -Original Message-
  From: Rasto Levrinc [mailto:rasto.levr...@linbit.com]
  Sent: Monday, November 30, 2009 11:38 AM
  To: pacemaker@oss.clusterlabs.org
  Subject: Re: [Pacemaker] is ptest 1.06 working correctly?
  
  
  On Mon, November 30, 2009 5:21 pm, Frank DiMeo wrote:
   I actually did use -- on the long options, for some reason the
   cut/paste in MS outlook collapsed them.  As you see from the
 enclosed
   files in my previous posting, the files are actually generated,
  there's
   just not much in them.
  
  
  Oh, I see. It is because you don't have any transitions in live cib.
 It
  works correctly as far as I can tell.
  
  Rasto
  
  
  --
  : Dipl-Ing Rastislav Levrinc
  : DRBD-MC http://www.drbd.org/mc/management-console/
  : DRBD/HA support and consulting http://www.linbit.com/
  DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.
  
  
  
  ___
  Pacemaker mailing list
  Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] no ais-keygen with ubuntu hardy and launchpad?

2009-11-04 Thread Darren.Mansell
Try corosync-keygen and work with /etc/corosync as if it were /etc/ais/

-Original Message-
From: Dirk Taggesell [mailto:dirk.tagges...@proximic.com] 
Sent: 04 November 2009 11:39
To: pacemaker@oss.clusterlabs.org
Subject: [Pacemaker] no ais-keygen with ubuntu hardy and launchpad?

Hi all,

I am about to get a simple HA cluster up and running and as the docs at
clusterlabs recommend, I tried openais instead of heartbeat.

Thus I incorporated
deb http://ppa.launchpad.net/ubuntu-ha/ppa/ubuntu hardy main
deb-src http://ppa.launchpad.net/ubuntu-ha/ppa/ubuntu hardy main

into /etc/apt/sources.list and installed:

openais, pacemaker, pacemaker-openais along with what is pulled in as
well because of dependencies.

But when I want to follow this:

http://clusterlabs.org/wiki/Initial_Configuration

there is neither ais-keygen nor is there /etc/ais. At least
/etc/corosync exists but neither openais nor corosync provide any
sufficient documentation.

So, did I miss a package to be installed or how does even the very basic
configuration work?




___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] why use ocf::linbit:drbd instead ofocf::heartbeat:drbd?

2009-10-12 Thread Darren.Mansell
On 2009-10-10 10:37, xin.li...@cs2c.com.cn wrote:
  Hi all:
 
  As I known, drbd (8.3.2 and above) in pacemaker has 2 ocf scripts, 
 one is from linbit, the other one is from heartbeat .
 
  In Andrew's Cluster form Scratch - Fedora 11 , Configure the 
 Cluster for DRBD , he uses ocf::linbit:drbd instead of 
 ocf::heartbeat:drbd
 
  why?

Because the heartbeat one is broken. It's that simple. Don't use it.



Can you say what parts are broken though? We have just completed 2 large 
projects using the heartbeat RA for DRBD as the Linbit version was not 
available at the start. SLES 11 HAE only ships DRBD 8.2.7 and using the later 
linbit OCF RA means compiling a later DRBD usertools and module from source and 
then it's not supported by Novell.

We haven't encountered any problems with the heartbeat RA yet and we can't just 
change to the later version without lots of testing.

Thanks.
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Low cost stonith device

2009-09-25 Thread Darren.Mansell
I find the riloe plugin to be very good so if you can get cheap HP
servers with iLO then that could constitute a low cost STONITH device.

-Original Message-
From: Mario Giammarco [mailto:mgiamma...@gmail.com] 
Sent: 24 September 2009 19:08
To: pacema...@clusterlabs.org
Subject: [Pacemaker] Low cost stonith device

Hello,

Can you suggest me a list of stonith devices compatible with 
pacemaker?

I need a low cost one.

I have also another idea to build a low cost stonith device:

I have intelligent switches. To stonith a node I can send to
 a switch the command 
to turn off all ethernet ports linked to the node to be fenced. 

So the node is powered on but it cannot do any harm because 
it is disconnected 
from network.

Is it a good idea? How can I implement it?

Thanks in advance for any help.

Mario



___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Arp and configuration advice

2009-09-09 Thread Darren.Mansell
 Greetings,
 
 I have a two webserver/two database server clustered setup. I've got
 ldirector and LVS managed by pacemaker and configured to be able to
run
 on either database machine.
 
 I know how to disable ARP for the machine not running ldirector,
 unfortunately I'm not sure how to dynamically get the webservers to
 update their ARP cache when ldirector gets moved upon failure.
 
 Is it possible to set up a service for the two web servers to delete
 the
 ARP cache for the VIP on the event that ldirector gets moved?

The IPaddr2 RA runs send_arp when the start action is called I believe:


#
# Run send_arp to note peers about new mac address
#
run_send_arp() {
ARGS=-i $ARP_INTERVAL_MS -r $ARP_REPEAT -p $SENDARPPIDFILE $NIC
$BASEIP auto not_used not_used
if [ x$IP_CIP = xyes ] ; then
if [ x = x$IF_MAC ] ; then
MY_MAC=auto
else
MY_MAC=`echo ${IF_MAC} | sed -e 's/://'`
fi
ARGS=-i $ARP_INTERVAL_MS -r $ARP_REPEAT -p $SENDARPPIDFILE
$NIC $BASEIP $MY_MAC not_used not_used
fi
ocf_log info $SENDARP $ARGS
case $ARP_BACKGROUND in
yes)
($SENDARP $ARGS || ocf_log err Could not send
gratuitous arps ) 2
;;
*)
$SENDARP $ARGS || ocf_log err Could not send gratuitous
arps
;;
esac
}


So when the VIP is started on another node, other nodes should be
notified the IP has changed hosts. Doesn't it work for you?

 
 I can build my own OCF script to update the arp cache, that's not an
 issue. I simply don't know how to configure pacemaker to say Oop. db2
 died. Move ldirector to db1 and tell the webservers to update their
ARP
 cache.
 
 Any suggestions?
 
 Thanks in advance!
 
 Justin
 

Regards,
Darren

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] Load-Balancing Confusion

2009-09-02 Thread Darren.Mansell
Can anyone help me clear up my confusion with load-balancing /
load-sharing using Linux-HA please?

 

I've always used ldirectord/LVS with an IPaddr2 resource (not cloned),
collocated them and put the virtual IP address on the loopback of all
nodes. When the IPaddr2 resource starts on any node it will remove the
VIP from the loopback on that node. Traffic hits the node with the
IPaddr2 and ldirectord resources then gets redirected off to other nodes
on their lo devices as they don't ARP. 

 

This has worked fine so far bar a few issues but I'm not sure I'm doing
it right.

 

I'm using lvs_support=true in my CIB to allow it to work but it doesn't
do what it says, it doesn't set the IP to the loopback device on a node
that isn't active, I have to do that myself. Should I be cloning the
IPaddr2 so it runs on both nodes? Would this need making into a
multi-state resource for this to happen?

 

Sorry for all the questions, I've just opened a can of worms with this,
all because I've just found I can't run more than one service on
127.0.0.1:80 so can't load balance more than one web server.

 

Thanks.

 

 

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Resource Failover in 2 Node Cluster

2009-08-19 Thread Darren.Mansell
I've now re-installed the SLES 11 HAE DRBD module, usertools and set the
cluster to use the heartbeat RA and it now fails over as expected. Does
the Linbit provided RA work differently? Is the following anything to do
with it from the logs?

 

Aug 19 11:08:37 gihub2 pengine: [4837]: notice: clone_print:
Master/Slave Set: MS-DRBD-Disk

Aug 19 11:08:37 gihub2 crmd: [4838]: info: unpack_graph: Unpacked
transition 126: 29 actions in 29 synapses

Aug 19 11:08:37 gihub2 pengine: [4837]: notice: print_list: Masters:
[ gihub1 ]

Aug 19 11:08:37 gihub2 crmd: [4838]: info: do_te_invoke: Processing
graph 126 (ref=pe_calc-dc-1250676517-500) derived from
/var/lib/pengine/pe-warn-1356.bz2

Aug 19 11:08:37 gihub2 pengine: [4837]: notice: print_list: Slaves:
[ gihub2 ]

Aug 19 11:08:37 gihub2 crmd: [4838]: info: te_pseudo_action: Pseudo
action 36 fired and confirmed

Aug 19 11:08:37 gihub2 pengine: [4837]: notice: group_print: Resource
Group: Resource-Group

Aug 19 11:08:37 gihub2 crmd: [4838]: info: te_pseudo_action: Pseudo
action 46 fired and confirmed

Aug 19 11:08:37 gihub2 pengine: [4837]: notice: native_print:
FileSystem(ocf::heartbeat:Filesystem):Started gihub1

Aug 19 11:08:37 gihub2 crmd: [4838]: info: te_rsc_command: Initiating
action 43: stop Virtual-IP_stop_0 on gihub1

Aug 19 11:08:37 gihub2 pengine: [4837]: notice: native_print:
ProFTPD   (lsb:proftpd):  Started gihub1

Aug 19 11:08:37 gihub2 crmd: [4838]: info: te_rsc_command: Initiating
action 62: notify DRBD-Disk:0_pre_notify_demote_0 on gihub2 (local)

Aug 19 11:08:37 gihub2 pengine: [4837]: notice: native_print: Tomcat
(lsb:tomcat):   Started gihub1

Aug 19 11:08:37 gihub2 crmd: [4838]: info: do_lrm_rsc_op: Performing
key=62:126:0:76a53bb6-ce93-4f38-81b5-f3af04223710
op=DRBD-Disk:0_notify_0 )

Aug 19 11:08:37 gihub2 pengine: [4837]: notice: native_print:
Virtual-IP(ocf::heartbeat:IPaddr2):   Started gihub1

Aug 19 11:08:37 gihub2 crmd: [4838]: info: te_rsc_command: Initiating
action 65: notify DRBD-Disk:1_pre_notify_demote_0 on gihub1

Aug 19 11:08:37 gihub2 pengine: [4837]: WARN: native_color: Resource
DRBD-Disk:1 cannot run anywhere

 

From: darren.mans...@opengi.co.uk [mailto:darren.mans...@opengi.co.uk] 
Sent: 19 August 2009 10:20
To: pacemaker@oss.clusterlabs.org
Subject: [Pacemaker] Resource Failover in 2 Node Cluster

 

Hello everyone. I'm a little confused about how this set up should work.
This is my config:

 

node gihub1

node gihub2

primitive stonith-SSH stonith:ssh \

params hostlist=gihub1 gihub2

primitive DRBD-Disk ocf:linbit:drbd \

params drbd_resource=gihub_disk \

op monitor interval=59s role=Master timeout=30s \

op monitor interval=60s role=Slave timeout=30s

primitive FileSystem ocf:heartbeat:Filesystem \

params fstype=ext3 directory=/www device=/dev/drbd0 \

op monitor interval=30s timeout=15s \

meta migration-threshold=10

primitive ProFTPD lsb:proftpd \

op monitor interval=20s timeout=10s \

meta migration-threshold=10

primitive Tomcat lsb:tomcat \

op monitor interval=20s timeout=10s \

meta migration-threshold=10

primitive Virtual-IP ocf:heartbeat:IPaddr2 \

params ip=2.21.4.45 broadcast=2.255.255.255 nic=eth0
cidr_netmask=8 \

op monitor interval=30s timeout=15s \

meta migration-threshold=10

group Resource-Group FileSystem ProFTPD Tomcat Virtual-IP

ms MS-DRBD-Disk DRBD-Disk \

meta clone-max=2 notify=true globally-unique=false

clone STONITH-clone stonith-SSH

location DRBD-Master-Prefers-GIHub1 MS-DRBD-Disk \

rule $id=drbd_loc_rule $role=master 100: #uname eq gihub1

colocation Resource-Group-With-DRBD-Master inf: Resource-Group
MS-DRBD-Disk:Master

order Start-DRBD-Before-Filesystem inf: MS-DRBD-Disk:promote
FileSystem:start

property $id=cib-bootstrap-options \

dc-version=1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a \

expected-quorum-votes=2 \

no-quorum-policy=ignore \

start-failure-is-fatal=false \

stonith-action=poweroff \

last-lrm-refresh=1250615730 \

stonith-enabled=false

 

 

I had assumed (and I'm sure it worked like this before) that if I reboot
gihub1, all the resources should instead start on gihub2. I have tried
with stonith-enabled=true which doesn't seem to change anything. Can
anyone see from my config or the attached messages log what is going on?
I've compiled DRBD 8.3.2 and I'm using the new Libit DRBD RA. I'll try
using the heartbeat RA in the meantime.

 

Many thanks

Darren Mansell

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] Temporarily Stop Cloned Resource on 1 Node

2009-08-07 Thread Darren.Mansell
Hello all.

 

I have a cloned resource that I need to stop temporarily on 1 node. Am I
missing something quite obvious because I can't figure out how to do it
without reconfiguring the CIB.

 

Pacemaker 1.0.3 on SLES 11.

 

Thanks.

Darren

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] setup for flexlm lmgrd failover

2009-07-02 Thread Darren.Mansell
 
 Hi
 
 I'm looking for a How-To on setting up Pacemaker for a failover pair
 of suse 10.2 flexlm license managers servers. For both Portland Group
 and Intel compilers float licenses.
 
 mac address take-over then start lmgrd etc.
 
 Many thanks
 
 Jonathan
 
 ___

Hello. Can I just suggest you use SLES 11 HAE instead of SLES 10.2? The
former works much better for HA.

Darren

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker