Re: [Linux-HA] type: pseduo

2011-07-04 Thread Andrew Beekhof
On Mon, Jul 4, 2011 at 7:06 PM, Ulrich Windl
ulrich.wi...@rz.uni-regensburg.de wrote:
 Hi,

 found this syslog message on a SLES11 SP1 system:
 Jul  4 10:55:14 rksaph02 crmd: [11517]: WARN: print_elem:     [Action 83]: 
 Pending (id: grp_t11_as2_stopped_0, type: pseduo, priority: 2570)

 I guess the type should be pseudo...

yes
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Q: ms-resources and grouping

2011-07-01 Thread Andrew Beekhof
On Thu, Jun 30, 2011 at 7:41 PM, Ulrich Windl
ulrich.wi...@rz.uni-regensburg.de wrote:
 Hi!

 I have a question: when I want to have a filesystem on a logical volume, 
 where the VG is on a RAID1, I would typically have three resources to handle 
 that. Now if I wish to have a clone or ms resource, how could I connect the 
 resources so that the resource's nodes find the desired filesystems?

 order and colocation?

Correct :)
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Help for the floating IPaddress monitoing

2011-07-01 Thread Andrew Beekhof
On Thu, Jun 30, 2011 at 3:47 PM, 徐斌 robin@163.com wrote:
 Hi Gent,

 I want to let the floating IP running again after restart the network.
 But I met the issue when I enable the monitoring for the floating ip (using 
 ocf:heartbeat:IPaddr2).

 [root@master ~]# crm configure show ip2
 primitive ip2 ocf:heartbeat:IPaddr2 \
    params ip=172.20.33.88 nic=eth1 iflabel=0 
 cidr_netmask=255.255.255.0 \
    op monitor interval=10s

 The IP address configured on the eth0 was lost, and it's so bad that I cannot 
 up the nic before I stop the heartbeat.
 eth1      Link encap:Ethernet  HWaddr 08:00:27:11:87:63
          inet6 addr: fe80::a00:27ff:fe11:8763/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:127723 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7288 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:11752527 (11.2 MiB)  TX bytes:1331914 (1.2 MiB)

 eth1:0    Link encap:Ethernet  HWaddr 08:00:27:11:87:63
          inet addr:172.20.33.88  Bcast:172.20.33.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

 [root@master ~]# ifup eth1
 [root@master ~]# ifconfig eth1
 eth1      Link encap:Ethernet  HWaddr 08:00:27:11:87:63
          inet6 addr: fe80::a00:27ff:fe11:8763/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:127956 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7362 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:11773863 (11.2 MiB)  TX bytes:1341574 (1.2 MiB)

 I think there maybe a time racing for the '/etc/init.d/network' and 
 'pacemaker', if 'pacemaker' start the eth1:0 first, then it will not set the 
 IP address for eth1.

Right, the resource agent assumes the device is always up and only
adds/removes aliases.


 Does anyone also have the issue? and are there any other way to restart the 
 floating IP without enable the monitoring operation.

 Regards,
 -robin


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Question on order rule

2011-06-30 Thread Andrew Beekhof
On Thu, Jun 9, 2011 at 4:58 AM, Alessandro Iurlano
alessandro.iurl...@gmail.com wrote:
 Hello.

 I'm trying to setup an highly available OpenVZ cluster. As OpenVZ only
 supports disk quota on ext3/4 local filesystems  (nfs and gfs/ocfs2
 don't work),
 I have setup two iscsi volumes on an highly available storage where
 VMs will be stored.
 I would like to use three servers, two active and one spare, for this
 cluster. That's because OpenVZ expects all the VMs to be under /vz or
 /var/lib/vz directory. This leads to the constraint that each server
 can have only on iscsi volume attached and mounted on /vz.
 Also, as the ext3 fs is not a cluster filesystem, each iscsi volume
 has to be mounted on a single server at time.

 So I need to map two iscsi volumes to three servers. I have created a
 pacemaker configuration with two groups, group-iscsi1 and
 group-iscsi2, that take care of connecting the iscsi devices and
 mounting the filesystems on /vz. A negative colocation directive
 forbids the two groups from being active on the same cluster node at
 the same time.

 So far things are working. The problem is with the resource that
 controls OpenVZ. It is a lsb:vz primitive that needs to be a clone (I
 can't make it into separate lsb-vz1 or lsb-vz2 primitives as the
 cluster sees both active on every node because they refer to the same
 /etc/init.d/vz script).
 After creating the clone clone-vz, I defined the location constraints
 as location vz-on-iscsi inf: group-iscsi1 clone-vz and did the same
 for group-iscsi2.
 The clone resource needs to be started after filesystems are mounted,
 so I need two order constraints like order vz-after-fs1 inf:
 group-iscsi1:start clone-vz:start and order vz-after-fs2 inf:
 group-iscsi2:start clone-vz:start
 This works perfectly when both iscsi volumes are up. But if one of
 them is stopped, clone-vz is not starting. I guess this is because the
 two order constraints create dependencies for clone-vz on both the
 iscsi groups so that the clone is started only when group-iscsi1 AND
 group-iscsi2 have started.

 Is there a way I can tell pacemaker to start clone-vz on a node if
 that node has resource group-iscsi1 OR group-iscsi2?

Not yet I'm afraid.  Resource sets will eventually allow this but I
don't think its there yet.

 The complete pacemaker configuration is here:
 http://nopaste.info/8c2ba79159.html

 Thanks,
 Alessandro
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Web resource monitoring

2011-06-29 Thread Andrew Beekhof
On Mon, Jun 27, 2011 at 3:19 AM, Maxim Ianoglo dot...@gmail.com wrote:
 Hello,

 The http monitoring code should be split off from the apache RA.
 Then a simple stateless (see the Dummy RA for a sample) RA, say
 httpmon, can be created which would source the http monitoring.
 Patches accepted! Guidance and constructive critique offered :)
 Ok, thank you for suggestion with name of RA :)
 Here is what I wrote: 
 https://github.com/dotNox/heartbeat_resources/blob/master/httpmon
 Small description is available at: 
 http://dotnox.net/2011/06/multiple-ha-resources-based-on-same-service-heartbeat-httpmon-ra/
 I did not want to patch or change something in apache RA as if someone will 
 not have any cross of resources, like I have, and it will be easier to use 
 apache RA.

So if apache fails, how does this agent organise recovery?
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] serial cable or ethnet cable for heartbeat, which one is better?

2011-06-26 Thread Andrew Beekhof
On Mon, Jun 27, 2011 at 3:03 AM, Hai Tao taoh...@hotmail.com wrote:

 Which one is better for heartbeat, a serial cable or a dedicated ethernet 
 cable?

 Can the bandwidth of a serial cable be a bottleneck?

If you're running pacemaker - yes.

 How much data is transferred on the heartbeat link?


 Thanks.

 Hai Tao




 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] CIB process quits and could not connect to CRM

2011-06-26 Thread Andrew Beekhof
On Mon, May 16, 2011 at 11:36 PM, Mateusz Kalisiak
mateusz.kalis...@gmail.com wrote:
 Hello,

 I'm struggling the same problem on RHEL 6. Does anyone have some idea of
 solving this out?
 Any help would be appreciated.

You'd need to provide more details than that.
Have you tried reading the logs?


 Best Regards,
 Mateusz
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Best way for colocating resource on a dual primary drbd

2011-06-26 Thread Andrew Beekhof
On Mon, May 16, 2011 at 5:38 PM, RaSca ra...@miamammausalinux.org wrote:
 Il giorno Lun 16 Mag 2011 09:01:08 CET, Andrew Beekhof ha scritto:
 [...]

 Implicit that once the resource go away it becomes slave?

 Pretty sure this is a bug in 1.0.
 Have you tried 1.1.5 ?

 Not yet, but so Andrew are you saying that keeping the colocation even if I
 have a dual primary drbd is the best thing to do?

Yes.


 --
 RaSca
 Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
 ra...@miamammausalinux.org
 http://www.miamammausalinux.org


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Colocation of VIP and httpd

2011-06-26 Thread Andrew Beekhof
On Wed, Jun 1, 2011 at 12:04 PM, 吴鸿宇 whyfo...@gmail.com wrote:
 Thank you for your reply.

 My requirement is like this:
 The httpd service runs on every node in the cluster and is monitored by
 watchdog. VIP only runs on one node at a time. Heartbeat will check the
 status of httpd on each node and make sure the VIP runs on the node that has
 httpd running. Note that the Heartbeat is not expected to control the httpd
 service but just to monitor.

 Say I have Heartbeat, then I have the following configuration questions:
 1) Should I use clone for monitoring httpd?

No, you should clone the httpd service.
Each instance of the clone is responsible for monitoring itself.

 2) Which operation should I specified for the action of httpd service?
 fence or block or another?

It depends what you want. Try reading the documentation for those options.


 Is the combination clone+action+colocation enough for the requirement above?
 If not, what else special configuration do I need?

 Thank you for any advices!
 Hongyu

 On Tue, May 24, 2011 at 12:48 AM, RaSca ra...@miamammausalinux.org wrote:

 Il giorno Gio 19 Mag 2011 19:25:54 CET, 吴鸿宇 ha scritto:

  Hi All,
 I have a 2 node cluster. My intention is ensuring the VIP is always on the
 node that has httpd running, i.e. if service httpd on the VIP node is
 stopped and fails to start, the VIP should switch to the other node.
 With the configuration below, I observed that when httpd stops and fails
 to
 start, the VIP is stopped also but is not switched to the other node that
 has healthy httpd. I appreciate any ideas.

 [...]

 Some questions:
 Why httpd is cloned? Are you sure you want an INFINITY stickiness? Are logs
 saying anything helpful?

 Anyway, like Nikita said, consider upgrading Heartbeat to version 3.

 --
 RaSca
 Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
 ra...@miamammausalinux.org
 http://www.miamammausalinux.org


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] need help on email alerts

2011-06-26 Thread Andrew Beekhof
On Sun, Jun 5, 2011 at 8:53 PM, Amit Jathar amit.jat...@alepo.com wrote:
 Hi,

 I have configured email alerts for corosync as follows :-
 Crm configure show
 ---SNIP-
 primitive resMON ocf:pacemaker:ClusterMon \
        operations $id=resMON-operations \
        op monitor interval=180 timeout=20 \
        params extra_options=--mail-to x...@gmail.com
 ---SNIP---

 I can see this resource is started :-
 crm_mon -1
 ---SNIP
 resMON (ocf::pacemaker:ClusterMon):    Started xx
 ---SNIP

 I can send mail from my machine :-
 [root@localhost] mail -s testmail xx

We don't rely on a local mail server, instead we use libesmtp.
You'll need to make sure that is configured - or use call mail from a
script referenced by --external-agent

 .
 Cc:
 Null message body; hope that's ok

 I cannot get mails any mails if my cluster status changes. I could not see 
 anything in the /var/log/maillog also.

 Is there any hint if I am missing out any configuration.

 Thanks,
 Amit


 
 This email (message and any attachment) is confidential and may be 
 privileged. If you are not certain that you are the intended recipient, 
 please notify the sender immediately by replying to this message, and delete 
 all copies of this message and attachments. Any other use of this email by 
 you is prohibited.
 


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Virtual mysql cluster ip is not accessible on port 3306

2011-06-26 Thread Andrew Beekhof
On Thu, Jun 23, 2011 at 12:06 AM, Calistus Che calistus@gmail.com wrote:
 Hi Guys,

 could any one of you help me?

 I just set up a 2 lb (master and slave) and 2 mysql cluster nodes db1 and 2.

 The servers have 2 interfaces private and public and loadbalancing is
 running on the private network.

 Everything is pretty running fine till now, but the only problem access to
 the virtual ip.

 I would greatly appreciate your help.

Based on what?


 Regards

 KC
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Always Get a Billion Failed Actions

2011-06-26 Thread Andrew Beekhof
On Thu, Jun 16, 2011 at 8:38 PM, Robinson, Eric eric.robin...@psmnv.com wrote:
 crm_mon on my system displays a lot of failed actions, I guess because
 the init script for the resource is not fully lsb compliant?

 In any case, the resources seem to work okay and failover okay.

 How can I get rid of all those failed actions?

This is the cluster detecting that RAs don't exist on those nodes.
I think we added some extra logic to 1.1 that hid these when
symmetric-cluster=false was specified.


 crm_mon output follows...


 
 Last updated: Thu Jun 16 03:32:32 2011
 Stack: Heartbeat
 Current DC: ha07b.mydomain.com (6080642c-bad3-4bb8-80ba-db6b1f7a0735) -
 partition with quorum
 Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677
 3 Nodes configured, unknown expected votes
 4 Resources configured.
 

 Online: [ ha07c.mydomain.com ha07b.mydomain.com ha07a.mydomain.com ]

  Resource Group: g_clust04
     p_fs_clust04       (ocf::heartbeat:Filesystem):    Started
 ha07a.mydomain.com
     p_vip_clust04      (ocf::heartbeat:IPaddr2):       Started
 ha07a.mydomain.com
     p_mysql_001        (lsb:mysql_001):        Started
 ha07a.mydomain.com
     p_mysql_230        (lsb:mysql_230):        Started
 ha07a.mydomain.com
     p_mysql_231        (lsb:mysql_231):        Started
 ha07a.mydomain.com
     p_mysql_232        (lsb:mysql_232):        Started
 ha07a.mydomain.com
     p_mysql_233        (lsb:mysql_233):        Started
 ha07a.mydomain.com
     p_mysql_234        (lsb:mysql_234):        Started
 ha07a.mydomain.com
     p_mysql_235        (lsb:mysql_235):        Started
 ha07a.mydomain.com
     p_mysql_236        (lsb:mysql_236):        Started
 ha07a.mydomain.com
     p_mysql_237        (lsb:mysql_237):        Started
 ha07a.mydomain.com
     p_mysql_238        (lsb:mysql_238):        Started
 ha07a.mydomain.com
     p_mysql_239        (lsb:mysql_239):        Started
 ha07a.mydomain.com
     p_mysql_240        (lsb:mysql_240):        Started
 ha07a.mydomain.com
     p_mysql_241        (lsb:mysql_241):        Started
 ha07a.mydomain.com
     p_mysql_242        (lsb:mysql_242):        Started
 ha07a.mydomain.com
     p_mysql_243        (lsb:mysql_243):        Started
 ha07a.mydomain.com
     p_mysql_244        (lsb:mysql_244):        Started
 ha07a.mydomain.com
     p_mysql_245        (lsb:mysql_245):        Started
 ha07a.mydomain.com
     p_mysql_246        (lsb:mysql_246):        Started
 ha07a.mydomain.com
     p_mysql_247        (lsb:mysql_247):        Started
 ha07a.mydomain.com
     p_mysql_248        (lsb:mysql_248):        Started
 ha07a.mydomain.com
     p_mysql_249        (lsb:mysql_249):        Started
 ha07a.mydomain.com
     p_mysql_250        (lsb:mysql_250):        Started
 ha07a.mydomain.com
     p_mysql_251        (lsb:mysql_251):        Started
 ha07a.mydomain.com
     p_mysql_252        (lsb:mysql_252):        Started
 ha07a.mydomain.com
     p_mysql_253        (lsb:mysql_253):        Started
 ha07a.mydomain.com
     p_mysql_254        (lsb:mysql_254):        Started
 ha07a.mydomain.com
     p_mysql_255        (lsb:mysql_255):        Started
 ha07a.mydomain.com
     p_mysql_256        (lsb:mysql_256):        Started
 ha07a.mydomain.com
     p_mysql_257        (lsb:mysql_257):        Started
 ha07a.mydomain.com
     p_mysql_258        (lsb:mysql_258):        Started
 ha07a.mydomain.com
     p_mysql_259        (lsb:mysql_259):        Started
 ha07a.mydomain.com
     p_mysql_260        (lsb:mysql_260):        Started
 ha07a.mydomain.com
     p_mysql_261        (lsb:mysql_261):        Started
 ha07a.mydomain.com
     p_mysql_262        (lsb:mysql_262):        Started
 ha07a.mydomain.com
     p_mysql_263        (lsb:mysql_263):        Started
 ha07a.mydomain.com
     p_mysql_264        (lsb:mysql_264):        Started
 ha07a.mydomain.com
     p_mysql_265        (lsb:mysql_265):        Started
 ha07a.mydomain.com
     p_mysql_266        (lsb:mysql_266):        Started
 ha07a.mydomain.com
     p_mysql_267        (lsb:mysql_267):        Started
 ha07a.mydomain.com
     p_mysql_268        (lsb:mysql_268):        Started
 ha07a.mydomain.com
     p_mysql_269        (lsb:mysql_269):        Started
 ha07a.mydomain.com
     p_mysql_270        (lsb:mysql_270):        Started
 ha07a.mydomain.com
     p_mysql_271        (lsb:mysql_271):        Started
 ha07a.mydomain.com
     p_mysql_272        (lsb:mysql_272):        Started
 ha07a.mydomain.com
     p_mysql_273        (lsb:mysql_273):        Started
 ha07a.mydomain.com
     p_mysql_274        (lsb:mysql_274):        Started
 ha07a.mydomain.com
     p_mysql_275        (lsb:mysql_275):        Started
 ha07a.mydomain.com
     p_mysql_276        (lsb:mysql_276):        Started
 ha07a.mydomain.com
     p_mysql_277        (lsb:mysql_277):        Started
 ha07a.mydomain.com
     p_mysql_009        (lsb:mysql_009):        Started
 ha07a.mydomain.com
     p_mysql_021        (lsb:mysql_021):        Started
 ha07a.mydomain.com
     p_mysql_052    

Re: [Linux-HA] crm_report versus hb_report

2011-06-26 Thread Andrew Beekhof
Can you run it with -x and send me the screen output please?
It should be copying /var/log/syslog to the collector directory before
trying to call node_events on it.

On Fri, Jun 17, 2011 at 6:04 PM,  alain.mou...@bull.net wrote:
 Hi Andrew,
 ok thanks.
 So I tried it and got an error msg in the collector script around the
 node_events call :
 node_events `basename $logfile`  $EVENTS_F
 it outputs here :
 grep: syslog: No such file or directory

 whereas I trace the things around it :
 # Parse for events
 echo $logfile
 echo $EXTRA_LOGS
 for l in $logfile $EXTRA_LOGS; do
    node_events `basename $logfile`  $EVENTS_F

 and I got lofile : /var/log/syslog
 and I got one :
 -rw-r--r-- 1 root root 10549069 Jun 17 09:40 /var/log/syslog
 and variable EXTRA_LOGS is empty

 So the call seems to be :
 node_events syslog  $EVENTS_F

 So at the end, in the whole report, I got a cluster-log.txt linked to
 syslog which does not exists

 I tried to modifiy substitute the line by :
 node_events /var/log/syslog  $EVENTS_F
 and it does no more display the error about grep: syslog: No such file or
 directory

 but I don't know if it is really the good fix (as I got in both cases an
 empty events.txt but perhaps it is a coincidence ...

 Any idea ?

 Thanks
 Alain



 De :    Andrew Beekhof and...@beekhof.net
 A :     General Linux-HA mailing list linux-ha@lists.linux-ha.org
 Date :  17/06/2011 09:31
 Objet : Re: [Linux-HA] crm_report versus hb_report
 Envoyé par :    linux-ha-boun...@lists.linux-ha.org



 On Fri, Jun 17, 2011 at 5:30 PM, Andrew Beekhof and...@beekhof.net
 wrote:
 On Fri, Jun 17, 2011 at 4:47 PM,  alain.mou...@bull.net wrote:
 Hi,

 I just discover that on RH6 there is no more hb_report, it has been
 remove
 from cluster-glue rpm .
 Does the crm_report delivered in pacemaker rpm give the sames results
 as
 hb_report ?

 Yes.  It re-uses much of the same gathering code but with a slightly
 revised design.

 In fact its also flag compatible with hb_report.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat three node configuration

2011-06-26 Thread Andrew Beekhof
On Thu, Jun 9, 2011 at 11:54 PM, Ricardo F ri...@hotmail.com wrote:

 What is the configuration for create a three node cluster?,

Essentially you need Pacemaker on top.
haresources based clusters were only designed for 2-nodes.

 i have this but the servers bring-up the shared ip at same time:
 ha.cflogfacility local0keepalive 2deadtime 10warntime 5initdead 
 30auto_failback offucast bond0 host1 host2 host3node host1node host2node host3
 haresourceshost1 192.168.1.10/24/bond0
 i use  heartbeat 3.0.3 in a debian squeeze in all of the nodes, all of them 
 have in the /etc/hosts the others ips and i can propagate the conf with 
 ha_propagate.

 Thanks
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ClusterIP clone resource failover and migration issue

2011-06-26 Thread Andrew Beekhof
On Mon, Jun 6, 2011 at 10:22 PM, Randy Wilson randyedwil...@gmail.com wrote:
 Hi,

 I've setup two ClusterIP instances on a two node cluster using the
 below configuration:

 node node1.domain.com
 node node2.domain.com
 primitive clusterip_33 ocf:heartbeat:IPaddr2 \
     params ip=xxx.xxx.xxx.33 cidr_netmask=27 nic=eth0:10
 clusterip_hash=sourceip-sourceport-destport mac=01:XX:XX:XX:XX:XX
 primitive clusterip_34 ocf:heartbeat:IPaddr2 \
     params ip=xxx.xxx.xxx.34 cidr_netmask=27 nic=eth0:11
 clusterip_hash=sourceip-sourceport-destport mac=01:XX:XX:XX:XX:XX
 clone clone_clusterip_33 clusterip_33 \
     meta globally-unique=true clone-max=2 clone-node-max=2
 notify=true target-role=Started \
     params resource-stickiness=0
 clone clone_clusterip_34 clusterip_34 \
     meta globally-unique=true clone-max=2 clone-node-max=2
 notify=true target-role=Started \
     params resource-stickiness=0
 property $id=cib-bootstrap-options \
     dc-version=1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd \
     cluster-infrastructure=openais \
     stonith-enabled=false \
     expected-quorum-votes=2 \
     last-lrm-refresh=1307352624

 The resources start up on each node, with the correct iptables rules
 being assigned.

 
 Last updated: Mon Jun  6 11:29:24 2011
 Stack: openais
 Current DC: node1.domain.com - partition with quorum
 Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
 2 Nodes configured, 2 expected votes
 2 Resources configured.
 

 Online: [ node1.domain.com node2.domain.com ]

  Clone Set: clone_clusterip_33 (unique)
  clusterip_33:0    (ocf::heartbeat:IPaddr2):    Started node1.domain.com
  clusterip_33:1    (ocf::heartbeat:IPaddr2):    Started node2.domain.com
  Clone Set: clone_clusterip_34 (unique)
  clusterip_34:0    (ocf::heartbeat:IPaddr2):    Started node1.domain.com
  clusterip_34:1    (ocf::heartbeat:IPaddr2):    Started node2.domain.com

 I receive an error whenever I attempt to migrate one of the resources,
 so that a single node handles all the ClusterIP traffic.

 crm(live)resource# migrate clusterip_33:1 node1.domain.com
 Error performing operation: Update does not conform to the configured 
 schema/DTD

You can't (yet) migrate individual instances.  Although
   migrate clusterip_33 node1.domain.com
might still do what you want.

 crm(live)resource# migrate clusterip_34:1 node1.domain.com
 Error performing operation: Update does not conform to the configured 
 schema/DTD

 And when one of the nodes is taken offline, by stopping corosync, the
 resources are stopped on the remaining node and cannot be started
 without the other node being brought back online.

 
 Last updated: Mon Jun  6 12:42:21 2011
 Stack: openais
 Current DC: node1.domain.com - partition WITHOUT quorum
 Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
 2 Nodes configured, 2 expected votes
 2 Resources configured.
 

 Online: [ node1.domain.com ]
 OFFLINE: [ node2.domain.com ]

 If I add a colocation to the config:

 colocation coloc_clusterip inf: clone_clusterip_33 clone_clusterip_34

 When the offline node is brought back up, all the resources are
 started on the other node.

 
 Last updated: Mon Jun  6 13:00:39 2011
 Stack: openais
 Current DC: node1.domain.com - partition WITHOUT quorum
 Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
 2 Nodes configured, 2 expected votes
 2 Resources configured.
 

 Online: [ node1.domain.com node2.domain.com ]

  Clone Set: clone_clusterip_33 (unique)
  clusterip_33:0    (ocf::heartbeat:IPaddr2):    Started node1.domain.com
  clusterip_33:1    (ocf::heartbeat:IPaddr2):    Started node1.domain.com
  Clone Set: clone_clusterip_34 (unique)
  clusterip_34:0    (ocf::heartbeat:IPaddr2):    Started node1.domain.com
  clusterip_34:1    (ocf::heartbeat:IPaddr2):    Started node1.domain.com

 Can anyone see where I'm going wrong with this?


 Many thanks,

 REW
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] using the pacemaker logo for the xing group

2011-06-21 Thread Andrew Beekhof
On Tue, Jun 21, 2011 at 5:22 PM, Keisuke MORI keisuke.mori...@gmail.com wrote:
 Hi Erkan,

 As I've sent a personal email to you and as Ikeda-san already replied to you,
 Anybody may use the logo in conjunction with any Pacemaker / Linux-HA
 related projects.

 The logo is a contribution from the Japanese Pacemaker / Linux-HA community,
 so asking the permission to the Japanese mailing list as you did is
 right but here is also fine.

Although its presence does imply some degree of official connection
with the project, so its nice to ask here too :-)


 You can obtain the logo from here: (sorry it's in Japanese)
 http://linux-ha.sourceforge.jp/wp/archives/369

 Regards,
 Keisuke MORI
 Linux-HA Japan Project.

 2011/6/21 Junko IKEDA tsukishima...@gmail.com:
 Hi Erkan,

 The pacemaker logos has been created by NTT group.
 I asked for the boss's permission,
 I think I can send them to you directory soon :)

 Did you post the similar mail to the Japanese mailing list before this?
 Sorry to inconvenience you.

 Thanks,
 Junko IKEDA

 NTT DATA INTELLILINK CORPORATION

 2011/6/20 erkan yanar erkan.ya...@linsenraum.de:

 Moin,

 I would like to use the (red/rabbit) pacemaker logo for the linux cluster 
 group in xing.
 Who do I have to ask for permission to use it?

 Regards
 Erkan


 --
 über den grenzen muß die freiheit wohl wolkenlos sein

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




 --
 Keisuke MORI
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] crm_report versus hb_report

2011-06-17 Thread Andrew Beekhof
On Fri, Jun 17, 2011 at 4:47 PM,  alain.mou...@bull.net wrote:
 Hi,

 I just discover that on RH6 there is no more hb_report, it has been remove
 from cluster-glue rpm .
 Does the crm_report delivered in pacemaker rpm give the sames results as
 hb_report ?

Yes.  It re-uses much of the same gathering code but with a slightly
revised design.

 Is there other tools for traces or is it sufficient in all cases for a
 Pacemaker/corosync stack ?

 Thanks
 Alain
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] crm_report versus hb_report

2011-06-17 Thread Andrew Beekhof
On Fri, Jun 17, 2011 at 5:30 PM, Andrew Beekhof and...@beekhof.net wrote:
 On Fri, Jun 17, 2011 at 4:47 PM,  alain.mou...@bull.net wrote:
 Hi,

 I just discover that on RH6 there is no more hb_report, it has been remove
 from cluster-glue rpm .
 Does the crm_report delivered in pacemaker rpm give the sames results as
 hb_report ?

 Yes.  It re-uses much of the same gathering code but with a slightly
 revised design.

In fact its also flag compatible with hb_report.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Status about the four stack options

2011-05-26 Thread Andrew Beekhof
On Tue, May 24, 2011 at 9:54 AM,  alain.mou...@bull.net wrote:
 Hi
 Many thanks for this status. I suppose this is the same status on RHEL6 as
 Suse is
 likely to be in advance with regard to RHEL6 Pacemaker  corosync
 evolutions ?

This is not implied.

On RHEL6, although Pacemaker is not officially supported yet, people
are encouraged to use option 2 or 3 due to the improved
startup/shutdown reliability.

I'd imagine option 4 is about a year or so away.


 Alain



 De :    Lars Marowsky-Bree l...@suse.de
 A :     General Linux-HA mailing list linux-ha@lists.linux-ha.org
 Date :  23/05/2011 13:10
 Objet : Re: [Linux-HA] Status about the four stack options
 Envoyé par :    linux-ha-boun...@lists.linux-ha.org



 On 2011-05-23T10:49:16, alain.mou...@bull.net wrote:

 Hi

 I just wonder the status of the 4 stack options :
 from which releases of Pacemaker  corosync are the 3 and 4 options
 available ?
 and on which Distribution ? RHEL6 ?

 1.      corosync + pacemaker plugin (v0)

 This is what SUSE Linux Enterprise High-Availability Extension uses, and
 is fully supported.

 2.      corosync + pacemaker plugin (v1) + mcp

 We may switch to this at a later time during the SLE HA cycle.

 3.      corosync + cpg + cman + mcp
 4.      corosync + cpg + quorumd + mcp

 On SLE, we're bound to skip 3, but 4) is probably somewhere in the very
 late future, once it is fully stabilized and integrated.


 Regards,
    Lars

 --
 Architect Storage/HA, OPS Engineering, Novell, Inc.
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
 Imendörffer, HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ocfs2

2011-05-26 Thread Andrew Beekhof
On Tue, May 24, 2011 at 2:33 PM, Eric Warnke ewar...@albany.edu wrote:

 Fedora 14 lacks dlm-pcmk since it has been depreciated.  Really
 frustrating as whatprovides shows a file but yum install says nothing to
 do without installing.  Most of the existing quick start docs are
 therefor inapplicable as they presume that you have dlm_controld.pcmk.

 https://www.redhat.com/archives/linux-cluster/2011-March/msg00084.html

 Overall I went back over a number of steps and found some errors and was
 able to get it up and running.

You need to use pacemaker + cman and the regular *_controld daemons.
See the 1.1 version of clusters from scratch up at clusterlabs.



 1) Went back over and reconfigured cman + pacemaker without corosync
 2) Since I had presumed that I would be integrating with pacemaker I had
 failed to install the ocfs2-tools-cman package
 3) Somewhere along the way I setup the cluster.conf with the full hostname
 leading to all sorts of fun with pacemaker listing 6 nodes rather than
 three.  crm configure erase nodes was able to clear that up once the
 cluster.conf files were stable.
 4) Once those two things were stable I was able to bring up o2cb and ocfs2
 clones under pacemaker ( my understanding is dlm is already up thanks to
 cman ).

 At this point I'll probably have to take a step back and try rebuilding
 this cluster to make sure I have the flow right.

 Am I correct in presuming that, short of membership and quorm in cman,
 pacemaker is where I configure STONITH and obviously all services?

 Cheers,
 Eric


 On 5/23/11 4:18 PM, asimonell...@gmail.com asimonell...@gmail.com
 wrote:

I found the following link extremely useful for setting up a OCFS with
OpenAIS/Corosync:

http://www.novell.com/documentation/sle_ha

-Anthony
--Original Message--
From: Eric Warnke
Sender: linux-ha-boun...@lists.linux-ha.org
To: Linux-HA mailing list
ReplyTo: General Linux-HA mailing list
Subject: [Linux-HA] ocfs2
Sent: May 23, 2011 3:13 PM


I have been chasing my tail all day trying to get a simple 3 node cluster
to
mount an ocfs2 filesystem over iscsi on Fedora 14.  Up until this morning
it
was working wonderfully for testing HA NFSv4 where the filesystems were
non-clustered xfs volumes.

Is there any useful documentation for converting a simple corosync +
pacemaker installation to being able to mount an ocfs2 filesystem?

-Eric


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Sent via BlackBerry from T-Mobile
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DO NOT start using heartbeat 2.x in crm mode, but just use Pacemaker, please! [was: managing resource httpd in heartbeat]

2011-05-19 Thread Andrew Beekhof
On Thu, May 19, 2011 at 12:36 PM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Wed, May 18, 2011 at 05:21:35PM -0700, Vinay Nagrik wrote:
 Hello everybody,

 I am running Centos 5.2 with heartbeat 2.1.3 and we as a group run it on
 appliances and *it is readily not possible to suddenly upgrade heartbeat to
 a later version which runs pacemaker*.

 Oh yes you can.  If you want to use the cib based crmd style
 configuration, you will have to.  Why do you think you can not?

 You cannot change the facts by asking the same question again.

 But if it helps to get the message across, please, anyone that wants to,
 feel free to add
        Confirmed by

The guy that wrote most of the stuff being discussed.

Seriously, are we still having this conversation?  Its two years since I wrote:
   http://www.mail-archive.com/linux-ha@lists.linux-ha.org/msg12684.html

and its even more true now.

 to this thread.

  ;-)


 My requirement as a developer to explore possibilities to manage resource
 especially Apache web server in such a fashion that on a two node cluster if
 *on the active node  the Apache service goes down then* the heartbeat should
 shut down that active node and transfer the control to other node, which has
 the Apache serviceable..

 Could someone please tell me the literature, where I can get some
 configuration parameters.

 I know it can be done in a cib.xml file, which is part of heartbeat 2.1.3
 and also some other utilities like cibadmin.

 If you, for what ever (non technical) reason you think you are stuck
 with heartbeat 2.1.x, stay with haresources, possibly add mon into
 your setup, and hope for the best.

 If you want to go cib and crm, you absolutely have to use Pacemaker.
 Whether you then use Pacemaker on top of heartbeat (3.0.x) or corosync
 is an other decision, and unaffected by this.

 But please accept that Pacemaker is the very same CRMD (cluster resource
 manager daemon) that came with heartbeat 2.1.x, only four (4!) years of
 bug fixing and development later.

 So if you insist on using some known to be buggy 4 year old piece of
 software, just because one component of it now goes by a different name,
 I'm sorry, but you should not expect much help.

 I think we told you that already?


 Still, you asked to be pointed to documentation.

 There is the section Documentation For Older Releases
 http://www.clusterlabs.org/wiki/Documentation

 Which links to some relevant docs about 2.1.x.

 Just in case you absolutely love to hurt yourself

  ;-)

 --
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com

 DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Best way for colocating resource on a dual primary drbd

2011-05-16 Thread Andrew Beekhof
On Sat, May 14, 2011 at 9:31 AM, RaSca ra...@miamammausalinux.org wrote:
 Il giorno Ven 13 Mag 2011 16:09:14 CET, Viacheslav Biriukov ha scritto:
 In your case you have two drbd master. So, I think, it is not a good
 idea to create that collocation. Instead of this you can set location
 directives to locate vm-test_virtualdomain where you want to be default.
 For example:
 location L_vm-test_virtualdomain_01 vm-test_virtualdomain 100: master1.node
 location L_vm-test_virtualdomain_02 vm-test_virtualdomain 10:   master2.node

 And I agree to your point of view (since I test that the colocation is
 not working). But the point is: why? I mean, the colocation defines that
 the drbd device must run in a node where drbd is Master. Why Pacemaker
 puts drbd in slave on the node in which the migration start? Does a
 colocation like this:

 colocation vm-test_virtualdomain_ON_vm-test_ms-r0 inf:
 vm-test_virtualdomain vm-test_ms-r0:Master

 Implicit that once the resource go away it becomes slave?

Pretty sure this is a bug in 1.0.
Have you tried 1.1.5 ?
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Behaviour when rebooting inactive node

2011-05-10 Thread Andrew Beekhof
On Mon, May 9, 2011 at 2:38 PM, Nicolas Guenette
nicol...@ho.reitmans.com wrote:
 Hello,

 I have a two node cluster and a question about the cluster's behaviour
 when I reboot the inactive node.

 The situation is this: if the resources are running on serverA and I
 reboot serverB, serverA re-acquires the resources when it detects
 serverB leaving. Meaning, it actually re-runs my startup scripts!
 That's not good...

 If there any way to configure Linux-HA so that it doesn't behave like
 that in case its other node is rebooted? How can I stop the this
 re-acquiring of resources?

Hard to say without some idea of what version you're running and what
your config looks like.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] Filesystem ocf file

2011-05-06 Thread Andrew Beekhof
On Fri, May 6, 2011 at 9:37 AM, Florian Haas florian.h...@linbit.com wrote:
 On 2011-05-06 09:26, Darren Thompson wrote:
 Team

 I was reviewing some errors on a cluster mounted file-system that caused
 me to review the Filesystem ocf file.

 I notice that it uses an undeclared parameter of OCF_CHECK_LEVEL to
 determine what degree of testing of the filesystem is required in monitor

 I have now updated it to more formally work with a check_level value
 with the more obvious values of mounted, read  write ( my updated
 version attached )

 Could someone (Florian is this something you can do?) please review this
 with a view to patching the upstream Filesystem ocf file.

 NACK, sorry. The OCF_CHECK_LEVEL is specific to the monitor action and
 described as such in the OCF spec; this will not be changed without a
 change to the spec.

 To use it, set op monitor interval=X OCF_CHECK_LEVEL=Y

 Yes, it's poorly designed, it makes no sense why this is pretty much the
 only sensible time to set a parameter specifically for an operation (as
 opposed to on a resource), it's inexplicable why it's all caps, etc.,
 but that's the way it is.

Honest. It was broken when we got here.  Maybe it was the neighbor's dog?
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] [ha-wg] Cluster Stack - Ubuntu Developer Summit

2011-05-05 Thread Andrew Beekhof
On Thu, May 5, 2011 at 10:25 AM, Florian Haas florian.h...@linbit.com wrote:
 On 2011-04-26 19:33, Andres Rodriguez wrote:
 UDS' are open-to-public events, and I believe it would be great if
 upstream could participate and maybe even further the discussion about
 the Cluster Stack. For more information about UDS, please visit [1]. The
 specific date/time for the Cluster Stack session is not yet available.

 If you require any further information please don't hesitate to contact me.

 Andres already knows this, but FWIW I'll repost here that I'll be at UDS
 in time for the cluster stack session at 12 noon on 5/12. I'll stay in
 Budapest that evening and will probably join the Budapest sightseeing
 tour that the Hungarian Ubuntu team is organizing, so if anyone wants to
 link up with Andres and me for a few beverages please let us know.

 Andrew, interested in making a day trip to Budapest while you're still
 on this continent?

With under 4 weeks to go - not a chance :-)
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] ACLs and privilege escalation (was Re: New OCF RA: symlink)

2011-05-05 Thread Andrew Beekhof
On Thu, May 5, 2011 at 9:09 AM, Florian Haas florian.h...@linbit.com wrote:
 Rather than going into ACLs in more detail, I wanted to highlight that
 however we limit access to the CIB, the resource agents still _execute_
 as root, so we will always have what would normally be considered a
 privilege escalation issue.

 Now, we could agree on security guidelines for RAs, and some of those
 would certainly be no-brainers to define (such as, don't ever eval
 unsanitized user input), but I refuse to even suggest to tackle any such
 guidelines before the OCF spec update has gotten off the ground.

 One such thing that could be added to the spec would be optional meta
 variables named user and group, directing the LRM (or any successor)
 to execute the RA as that user rather than root. Just an idea.

Seems plausible.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] New OCF RA: symlink

2011-05-05 Thread Andrew Beekhof
On Wed, May 4, 2011 at 4:36 PM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
  Services running under Pacemaker control are probably critical,
  so a malicious person with even only stop access on the CIB
  can do a DoS. I guess we have to assume people with any write access
  at all to the CIB are trusted, and not malicious.

Exactly. If the cluster (or access to it) has been compromised, you're
in for so much pain that a symlink RA is the least of your problems.
A generic cluster manager is, by design, a way to run arbitrary
scripts as root - there's no coming back from there.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] get haresources2cib.py

2011-05-04 Thread Andrew Beekhof
On Tue, May 3, 2011 at 9:50 PM, Vinay Nagrik vnag...@gmail.com wrote:
 Hello Andrew,

 We have been goint in small details and I still did not get any answer,
 which will put me on the right path.

 I apologise to ask you these questions.  But these are important for my
 work.

 I have down loaded

 *eat-3-0-STABLE-3.0.4.tar.bz2*

heartbeat != pacemaker

 and unzipped it.

 I looked for crm shell and any .dtd .DTD file and did not find any.

 Please please tell me where to get the crm shell or what are the steps or
 did I download a wrong .tar.bz2 file.

 There were these files as well

 glue-1.0.7.tar.bz2 http://hg.linux-ha.org/glue/archive/glue-1.0.7.tar.bz2
 and
 *gents-1.0.4.tar.gz*

 Do I have to download these files also.

 My first and  very first step is to create a cib.xml file.  And I am running
 in small circles.

 Kindly help.  I will greatly apprecite this.

 Thanks.

 arun
 On Mon, May 2, 2011 at 11:16 PM, Andrew Beekhof and...@beekhof.net wrote:

 On Mon, May 2, 2011 at 9:33 PM, Vinay Nagrik vnag...@gmail.com wrote:
  Thank you Andrew.
 
  Could you please tell me where to get the DTD for cib.xml and where from
 can
  I download crm shell.

 Both get installed with the rest of pacemaker

 
  thanks in anticipation.
 
  With best regards.
 
  nagrik
 
  On Mon, May 2, 2011 at 12:56 AM, Andrew Beekhof and...@beekhof.net
 wrote:
 
  On Sun, May 1, 2011 at 9:26 PM, Vinay Nagrik vnag...@gmail.com wrote:
   Dear Andrew,
  
   I read your document clusters from scratch and found it very
 detailed.
   It
   gave lots of information, but I was looking for creating a cib.xml and
  could
   not decipher the language as to the syntex and different fields to be
 put
  in
   cib.xml.
 
  Don't look at the xml.  Use the crm shell.
 
  
   I am still looking for the haresources2cib.py script.
 
  Don't. It only creates configurations conforming to the older and now
  unsupported syntax.
 
    I searched the web
   but could not find anywhere.
  
   I have 2 more questions.
  
   Do I have to create the cib.xml file on the nodes I am running
 heartbeat
  v.2
   software.
   Does cib.xml has to reside in /var/lib/crm directory or can it reside
   anywhere else.
  
   Kindly provide these answers.  I will greatly appreciate your help.
  
   Have a nice day.
  
   Thanks.
  
   nagrik
  
   On Sat, Apr 30, 2011 at 1:32 AM, Andrew Beekhof and...@beekhof.net
  wrote:
  
   Forget the conversion.
   Use the crm shell to create one from scratch.
  
   And look for the clusters from scratch doc relevant to your version
   - its worth the read.
  
   On Sat, Apr 30, 2011 at 1:19 AM, Vinay Nagrik vnag...@gmail.com
  wrote:
Hello Group,
   
Kindly tell me where can I download
   
haresources2cib.py file
   
from.
   
Please also tell me can I convert haresources file on a node where
 I
  am
   not
running high availability service and then can I copy the converted
  ..xml
file in
   
/var/lib/heartbeat
   
directory on which I am running the high availability.
   
Also does
   
cib file
   
must resiede under
   
/var/lib/heartbeat
   
directory or can it reside under any directory like under
   
/etc.
   
please let me know.  I am just a beginner.
   
Thanks in advance.
   
--
Thanks
   
Nagrik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
   
   ___
   Linux-HA mailing list
   Linux-HA@lists.linux-ha.org
   http://lists.linux-ha.org/mailman/listinfo/linux-ha
   See also: http://linux-ha.org/ReportingProblems
  
  
  
  
   --
   Thanks
  
   Nagrik
   ___
   Linux-HA mailing list
   Linux-HA@lists.linux-ha.org
   http://lists.linux-ha.org/mailman/listinfo/linux-ha
   See also: http://linux-ha.org/ReportingProblems
  
  ___
  Linux-HA mailing list
  Linux-HA@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha
  See also: http://linux-ha.org/ReportingProblems
 
 
 
 
  --
  Thanks
 
  Nagrik
  ___
  Linux-HA mailing list
  Linux-HA@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha
  See also: http://linux-ha.org/ReportingProblems
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




 --
 Thanks

 Nagrik
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org

Re: [Linux-HA] Filesystem do not start on Pacemaker-Cluster

2011-05-04 Thread Andrew Beekhof
On Tue, May 3, 2011 at 10:39 AM, KoJack kojac...@web.de wrote:

 Hi,
 i was trying to set up a pacemaker cluster. After I added all resources, the
 filesystem will not start at one Node.

 crm_verify -L -V


 crm_verify[30068]: 2011/05/03_10:35:39 WARN: unpack_rsc_op: Processing
 failed op WebFS:0_start_0 on apache01: unknown error (1)
 crm_verify[30068]: 2011/05/03_10:35:39 WARN: unpack_rsc_op: Processing
 failed op WebFS:0_stop_0 on apache01: unknown exec error (-2)
 crm_verify[30068]: 2011/05/03_10:35:39 WARN: common_apply_stickiness:
 Forcing WebFSClone away from apache01 after 100 failures (max=100)
 crm_verify[30068]: 2011/05/03_10:35:39 WARN: common_apply_stickiness:
 Forcing WebFSClone away from apache01 after 100 failures (max=100)
 crm_verify[30068]: 2011/05/03_10:35:39 WARN: common_apply_stickiness:
 Forcing WebFSClone away from apache01 after 100 failures (max=100)
 crm_verify[30068]: 2011/05/03_10:35:39 ERROR: clone_rsc_colocation_rh:
 Cannot interleave clone WebSiteClone and WebIP because they do not support
 the same number of resources per node
 crm_verify[30068]: 2011/05/03_10:35:39 ERROR: clone_rsc_colocation_rh:
 Cannot interleave clone WebSiteClone and WebIP because they do not support
 the same number of resources per node
 crm_verify[30068]: 2011/05/03_10:35:39 WARN: should_dump_input: Ignoring
 requirement that WebFS:0_stop_0 comeplete before WebFSClone_stopped_0:
 unmanaged failed resources cannot prevent clone shutdown
 Errors found during check: config not valid


 crm configure show

 node apache01
 node apache02
 primitive ClusterIP ocf:heartbeat:IPaddr2 \
        params ip=10.1.1.5 cidr_netmask=8 nic=eth0
 clusterip_hash=sourceip \
        op monitor interval=30s
 primitive WebData ocf:linbit:drbd \
        params drbd_resource=wwwdata \
        op monitor interval=60s \
        op start interval=0 timeout=240s \
        op stop interval=0 timeout=100s
 primitive WebFS ocf:heartbeat:Filesystem \
        params device=/dev/drbd/by-res/wwwdata directory=/var/www/html
 fstype=gfs2 \
        op start interval=0 timeout=60s \
        op stop interval=0 timeout=60s
 primitive WebSite ocf:heartbeat:apache \
        params configfile=/etc/httpd/conf/httpd.conf \
        op monitor interval=1min \
        op start interval=0 timeout=40s \
        op stop interval=0 timeout=60s
 primitive dlm ocf:pacemaker:controld \
        op monitor interval=120s \
        op start interval=0 timeout=90s \
        op stop interval=0 timeout=100s
 primitive gfs-control ocf:pacemaker:controld \
        params daemon=gfs_controld.pcmk args=-g 0 \
        op monitor interval=120s \
        op start interval=0 timeout=90s \
        op stop interval=0 timeout=100s
 ms WebDataClone WebData \
        meta master-max=2 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true
 clone WebFSClone WebFS
 clone WebIP ClusterIP \
        meta globally-unique=true clone-max=2 clone-node-max=2
 clone WebSiteClone WebSite
 clone dlm-clone dlm \
        meta interleave=true
 clone gfs-clone gfs-control \
        meta interleave=true
 colocation WebFS-with-gfs-control inf: WebFSClone gfs-clone
 colocation WebSite-with-WebFS inf: WebSiteClone WebFSClone
 colocation fs_on_drbd inf: WebFSClone WebDataClone:Master
 colocation gfs-with-dlm inf: gfs-clone dlm-clone
 colocation website-with-ip inf: WebSiteClone WebIP
 order WebFS-after-WebData inf: WebDataClone:promote WebFSClone:start
 order WebSite-after-WebFS inf: WebFSClone WebSiteClone
 order apache-after-ip inf: WebIP WebSiteClone
 order start-WebFS-after-gfs-control inf: gfs-clone WebFSClone
 order start-gfs-after-dlm inf: dlm-clone gfs-clone
 property $id=cib-bootstrap-options \
        dc-version=1.1.4-ac608e3491c7dfc3b3e3c36d966ae9b016f77065 \
        cluster-infrastructure=openais \
        expected-quorum-votes=2 \
        stonith-enabled=false \
        no-quorum-policy=ignore
 rsc_defaults $id=rsc-options \
        resource-stickiness=100


 Did you see any mistake in my configuration?

You mean apart from the 2 errors and the apache resource that cant
stop in the crm_verify output?


 Thanks a lot
 --
 View this message in context: 
 http://old.nabble.com/Filesystem-do-not-start-on-Pacemaker-Cluster-tp31530410p31530410.html
 Sent from the Linux-HA mailing list archive at Nabble.com.

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] get haresources2cib.py

2011-05-03 Thread Andrew Beekhof
On Mon, May 2, 2011 at 9:33 PM, Vinay Nagrik vnag...@gmail.com wrote:
 Thank you Andrew.

 Could you please tell me where to get the DTD for cib.xml and where from can
 I download crm shell.

Both get installed with the rest of pacemaker


 thanks in anticipation.

 With best regards.

 nagrik

 On Mon, May 2, 2011 at 12:56 AM, Andrew Beekhof and...@beekhof.net wrote:

 On Sun, May 1, 2011 at 9:26 PM, Vinay Nagrik vnag...@gmail.com wrote:
  Dear Andrew,
 
  I read your document clusters from scratch and found it very detailed.
  It
  gave lots of information, but I was looking for creating a cib.xml and
 could
  not decipher the language as to the syntex and different fields to be put
 in
  cib.xml.

 Don't look at the xml.  Use the crm shell.

 
  I am still looking for the haresources2cib.py script.

 Don't. It only creates configurations conforming to the older and now
 unsupported syntax.

   I searched the web
  but could not find anywhere.
 
  I have 2 more questions.
 
  Do I have to create the cib.xml file on the nodes I am running heartbeat
 v.2
  software.
  Does cib.xml has to reside in /var/lib/crm directory or can it reside
  anywhere else.
 
  Kindly provide these answers.  I will greatly appreciate your help.
 
  Have a nice day.
 
  Thanks.
 
  nagrik
 
  On Sat, Apr 30, 2011 at 1:32 AM, Andrew Beekhof and...@beekhof.net
 wrote:
 
  Forget the conversion.
  Use the crm shell to create one from scratch.
 
  And look for the clusters from scratch doc relevant to your version
  - its worth the read.
 
  On Sat, Apr 30, 2011 at 1:19 AM, Vinay Nagrik vnag...@gmail.com
 wrote:
   Hello Group,
  
   Kindly tell me where can I download
  
   haresources2cib.py file
  
   from.
  
   Please also tell me can I convert haresources file on a node where I
 am
  not
   running high availability service and then can I copy the converted
 ..xml
   file in
  
   /var/lib/heartbeat
  
   directory on which I am running the high availability.
  
   Also does
  
   cib file
  
   must resiede under
  
   /var/lib/heartbeat
  
   directory or can it reside under any directory like under
  
   /etc.
  
   please let me know.  I am just a beginner.
  
   Thanks in advance.
  
   --
   Thanks
  
   Nagrik
   ___
   Linux-HA mailing list
   Linux-HA@lists.linux-ha.org
   http://lists.linux-ha.org/mailman/listinfo/linux-ha
   See also: http://linux-ha.org/ReportingProblems
  
  ___
  Linux-HA mailing list
  Linux-HA@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha
  See also: http://linux-ha.org/ReportingProblems
 
 
 
 
  --
  Thanks
 
  Nagrik
  ___
  Linux-HA mailing list
  Linux-HA@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha
  See also: http://linux-ha.org/ReportingProblems
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




 --
 Thanks

 Nagrik
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: ocf:pacemaker:ping: dampen

2011-05-03 Thread Andrew Beekhof
On Mon, May 2, 2011 at 5:29 PM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Mon, May 02, 2011 at 04:04:56PM +0200, Andrew Beekhof wrote:
  Still, we may get a spurious failover in this case:
 
  reachability:
    +__
  Node A monitoring intervals:
         +    -    +    +    +    -    -    -    -    -
  Node B monitoring intervals:
      +    +    -    +    +    -    -    -    -    -
  dampening interval:         |-|
 
  Note how the dampening helps to ignore the first network glitch.
 
  But for the permanent network problem, we may get spurious failover:

 Then your dampen setting is too short or interval too long :-)

 No.
 Regardless of dampen and interval setting.

 Unless both nodes notice the change at the exact same time,
 expire their dampen at the exact same time,

This is where you've diverged.
Once dampen expires on one node, _all_ nodes write their current value.

 and place their updated
 values into the CIB at exactly the same time.

 If a ping node just dies, then one node will always notice it first.
 And regardless of dampen and interval settings,
 one will reach the CIB first, and therefor the PE will see the
 connectivity change first for only one of the nodes, and only later for
 the other (once it noticed, *and* expired its dampen interval, too).

 Show me how you can work around that using dampen or interval settings.

 --
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] get haresources2cib.py

2011-05-02 Thread Andrew Beekhof
On Sun, May 1, 2011 at 9:26 PM, Vinay Nagrik vnag...@gmail.com wrote:
 Dear Andrew,

 I read your document clusters from scratch and found it very detailed.  It
 gave lots of information, but I was looking for creating a cib.xml and could
 not decipher the language as to the syntex and different fields to be put in
 cib.xml.

Don't look at the xml.  Use the crm shell.


 I am still looking for the haresources2cib.py script.

Don't. It only creates configurations conforming to the older and now
unsupported syntax.

  I searched the web
 but could not find anywhere.

 I have 2 more questions.

 Do I have to create the cib.xml file on the nodes I am running heartbeat v.2
 software.
 Does cib.xml has to reside in /var/lib/crm directory or can it reside
 anywhere else.

 Kindly provide these answers.  I will greatly appreciate your help.

 Have a nice day.

 Thanks.

 nagrik

 On Sat, Apr 30, 2011 at 1:32 AM, Andrew Beekhof and...@beekhof.net wrote:

 Forget the conversion.
 Use the crm shell to create one from scratch.

 And look for the clusters from scratch doc relevant to your version
 - its worth the read.

 On Sat, Apr 30, 2011 at 1:19 AM, Vinay Nagrik vnag...@gmail.com wrote:
  Hello Group,
 
  Kindly tell me where can I download
 
  haresources2cib.py file
 
  from.
 
  Please also tell me can I convert haresources file on a node where I am
 not
  running high availability service and then can I copy the converted ..xml
  file in
 
  /var/lib/heartbeat
 
  directory on which I am running the high availability.
 
  Also does
 
  cib file
 
  must resiede under
 
  /var/lib/heartbeat
 
  directory or can it reside under any directory like under
 
  /etc.
 
  please let me know.  I am just a beginner.
 
  Thanks in advance.
 
  --
  Thanks
 
  Nagrik
  ___
  Linux-HA mailing list
  Linux-HA@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha
  See also: http://linux-ha.org/ReportingProblems
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




 --
 Thanks

 Nagrik
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: ocf:pacemaker:ping: dampen

2011-05-02 Thread Andrew Beekhof
On Mon, May 2, 2011 at 8:27 AM, Ulrich Windl
ulrich.wi...@rz.uni-regensburg.de wrote:
 Andrew Beekhof and...@beekhof.net schrieb am 29.04.2011 um 09:31 in 
 Nachricht
 BANLkTi=-ftyk9uxcgu0m2wqhquu_rt8...@mail.gmail.com:
 On Fri, Apr 29, 2011 at 9:27 AM, Dominik Klein d...@in-telegence.net wrote:
  It waits $dampen before changes are pushed to the cib. So that
  eventually occuring icmp hickups do not produce an unintended failover.
 
  At least that's my understanding.

 correcto

 Hi!

 Strange: So the update is basically just delayed by that amount of time? I 
 see no advantage: If you put a bad value to the CIB immediately or after some 
 delay, the value won't get better by that. Damping siggests some filtering 
 to me, but you are saying your are not filtering the values, but just 
 delaying them. Right?

Only the current value is written.
So the cluster will tolerate minor outages provided they last for
less than the dampen interval and the monitor frequency is high
enough.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: ocf:pacemaker:ping: dampen

2011-05-02 Thread Andrew Beekhof
On Mon, May 2, 2011 at 3:51 PM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Mon, May 02, 2011 at 01:20:16PM +0200, Andrew Beekhof wrote:
 On Mon, May 2, 2011 at 8:27 AM, Ulrich Windl
 ulrich.wi...@rz.uni-regensburg.de wrote:
  Andrew Beekhof and...@beekhof.net schrieb am 29.04.2011 um 09:31 in 
  Nachricht
  BANLkTi=-ftyk9uxcgu0m2wqhquu_rt8...@mail.gmail.com:
  On Fri, Apr 29, 2011 at 9:27 AM, Dominik Klein d...@in-telegence.net 
  wrote:
   It waits $dampen before changes are pushed to the cib. So that
   eventually occuring icmp hickups do not produce an unintended failover.
  
   At least that's my understanding.
 
  correcto
 
  Hi!
 
  Strange: So the update is basically just delayed by that amount of
  time? I see no advantage: If you put a bad value to the CIB
  immediately or after some delay, the value won't get better by that.
  Damping siggests some filtering to me, but you are saying your are
  not filtering the values, but just delaying them. Right?

 Only the current value is written.
 So the cluster will tolerate minor outages provided they last for
 less than the dampen interval and the monitor frequency is high
 enough.

 Still, we may get a spurious failover in this case:

 reachability:
   +__
 Node A monitoring intervals:
        +    -    +    +    +    -    -    -    -    -
 Node B monitoring intervals:
     +    +    -    +    +    -    -    -    -    -
 dampening interval:         |-|

 Note how the dampening helps to ignore the first network glitch.

 But for the permanent network problem, we may get spurious failover:

Then your dampen setting is too short or interval too long :-)


 One dampening interval after node B notices loss of reachability,
 it will trigger a PE run, potentially moving things from B to
 A, because on A, the reachability (in the CIB) is still ok.

 Shortly thereafter, the dampening interval on A also expires, and the
 CIB will be updated with A cannot reach out there either.

 Any resource migrations triggered by B cannot reach out there
 are now recognized as spurious.

 Question is, how could we avoid them?

 ipfail used to ask the peer, wait for the peer to notice the new
 situation as well, and only then trigger actions.

 We could possibly store a short history of values, and actually do some
 filtering arithmetic with them.  Not sure if this should be done
 inside or outside of the CIB. Probably outside.

Yes, outside :-)

One of these attrd needs a rewrite :-(
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] get haresources2cib.py

2011-04-30 Thread Andrew Beekhof
Forget the conversion.
Use the crm shell to create one from scratch.

And look for the clusters from scratch doc relevant to your version
- its worth the read.

On Sat, Apr 30, 2011 at 1:19 AM, Vinay Nagrik vnag...@gmail.com wrote:
 Hello Group,

 Kindly tell me where can I download

 haresources2cib.py file

 from.

 Please also tell me can I convert haresources file on a node where I am not
 running high availability service and then can I copy the converted ..xml
 file in

 /var/lib/heartbeat

 directory on which I am running the high availability.

 Also does

 cib file

 must resiede under

 /var/lib/heartbeat

 directory or can it reside under any directory like under

 /etc.

 please let me know.  I am just a beginner.

 Thanks in advance.

 --
 Thanks

 Nagrik
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ocf:pacemaker:ping: dampen

2011-04-29 Thread Andrew Beekhof
On Fri, Apr 29, 2011 at 9:27 AM, Dominik Klein d...@in-telegence.net wrote:
 It waits $dampen before changes are pushed to the cib. So that
 eventually occuring icmp hickups do not produce an unintended failover.

 At least that's my understanding.

correcto
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pingd does not react as expected = split brain

2011-04-28 Thread Andrew Beekhof
On Wed, Apr 27, 2011 at 7:18 PM, Stallmann, Andreas astallm...@conet.de wrote:
 Hi Andrew,

 According to your configuration, it can be up to 60s before we'll detect a 
 change in external connectivity.
 Thats plenty of time for the cluster to start resources.
 Maybe shortening the monitor interval will help you.

 TNX for the suggestion, I'll try that. Any suggestions on recommended monitor 
 intervals for pingd?

 Couldn't hurt.
 Hm... if I - for example, set the monitor interval to 10s, I'd have to adjust 
 the timeout for monitor to 10s as well, right?

Right.

 Ping is quite sluggish, it takes up to 30s to check the three nodes.

Sounds like something is misconfigured.

 If I now adust the interval to 10s, the next check might be triggered before 
 the last one is complete. Will this confuse pacemaker?

No. The next op will happen 10s after the last finishes.


 Yes, and there is no proper way to use DRBD in a three node cluster.
 How is one related to the other?
 No-one said the third node had to run anything.

 Ok, thanks for the info; I thought all members of the cluster had to be able 
 to run cluster resources. I would have to keep resources from trying to run 
 on the third node then via a location constrain, right?

Or node standby.


 TNX for your input!

 Andreas

 
 CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
 Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
 Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke 
 Höfer
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] Translate crm_cli.txt to Japanese

2011-04-27 Thread Andrew Beekhof
On Wed, Apr 27, 2011 at 12:54 PM, Dejan Muhamedagic de...@suse.de wrote:
 Hi Junko-san,

 On Wed, Apr 27, 2011 at 06:42:52PM +0900, Junko IKEDA wrote:
 Hi,

  May I suggest that you go with the devel version, because
  crm_cli.txt was converted to crm.8.txt. There are not many
  textual changes, just some obsolete parts removed.

 OK, I got crm.8.txt from devel.

 Each directory structure for Pacemaker 1.0,1.1 and devel is just a bit
 different.
 Does 1.0 keep its doc dir structure for now?

 Until the next release I guess.

 If so, it seems that just create html file is not so difficult when
 asciidoc is available.

 No, not difficult. It just depends on the build environment. If
 asciidoc is found by configure, then it is going to be used to
 produce the html files.

Do any distros _not_ ship asciidoc?
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Translate crm_cli.txt to Japanese

2011-04-27 Thread Andrew Beekhof
On Wed, Apr 27, 2011 at 3:47 PM, Dejan Muhamedagic de...@suse.de wrote:
 On Wed, Apr 27, 2011 at 02:01:40PM +0200, Andrew Beekhof wrote:
 On Wed, Apr 27, 2011 at 12:54 PM, Dejan Muhamedagic de...@suse.de wrote:
  Hi Junko-san,
 
  On Wed, Apr 27, 2011 at 06:42:52PM +0900, Junko IKEDA wrote:
  Hi,
 
   May I suggest that you go with the devel version, because
   crm_cli.txt was converted to crm.8.txt. There are not many
   textual changes, just some obsolete parts removed.
 
  OK, I got crm.8.txt from devel.
 
  Each directory structure for Pacemaker 1.0,1.1 and devel is just a bit
  different.
  Does 1.0 keep its doc dir structure for now?
 
  Until the next release I guess.
 
  If so, it seems that just create html file is not so difficult when
  asciidoc is available.
 
  No, not difficult. It just depends on the build environment. If
  asciidoc is found by configure, then it is going to be used to
  produce the html files.

 Do any distros _not_ ship asciidoc?

 AFAIK none of contemporary distributions. And going back three
 years or so, it's the other way around.


How quickly we forget.

Anyway, I advocate that the project makes decisions based on it being
around (but fails gracefully when its not) and leaves it up to older
distros to ship a pre-generated copy if they so desire.  I can't
imagine lack of HTML versions being a deal breaker.

And by fail gracefully, I mean the current behavior of just not
building those versions of the doc.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] Problem using Stonith external/ipmi device

2011-04-27 Thread Andrew Beekhof
On Tue, Apr 26, 2011 at 9:07 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 On Tue, Apr 19, 2011 at 02:46:06PM +0200, Andrew Beekhof wrote:
 On Tue, Apr 19, 2011 at 12:43 PM, Dejan Muhamedagic deja...@fastmail.fm 
 wrote:
  Hi,
 
  On Mon, Apr 11, 2011 at 09:41:12AM +0200, Andrew Beekhof wrote:
  On Fri, Apr 8, 2011 at 11:07 AM, Matthew Richardson
  m.richard...@ed.ac.uk wrote:
   On 07/04/11 16:36, Dejan Muhamedagic wrote:
   For whatever reason stonith-ng doesn't think that
   stonithipmidisk1 can manage this node. Which version of
   Pacemaker do you run? Perhaps this has been fixed in the
   meantime. I cannot recall right now if there has been such a
   problem, but it's possible. You can also try to turn debug on
   and see if there are more clues.
  
   I'm using Pacemaker 1.1.5 from the clusterlabs rpm-next repositories on 
   el5.
  
   I've tried turning on debug, but there's no more information coming out
   in the logs.
 
  man stonithd has the bits you need.
  start with pcmk_host_check
 
  That defaults to dynamic-list which should query the resource.
  Right?

 Right.

  Apparently, something's not quite ok there.

 the list command doesn't work perhaps?

 Yes, it does work. And it's been working since forever, as you
 know.

I'm not sure how I would know this, I've never used an ipmi device.

 Unless there's something wrong with the installation.

 Whatever happened here? Matthew?

 Thanks,

 Dejan

  BTW, I've
  been doing tests with external/ssh and it did work fine.

 also fine with fence_xvm
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pingd does not react as expected = split brain

2011-04-27 Thread Andrew Beekhof
On Wed, Apr 27, 2011 at 11:52 AM, Stallmann, Andreas
astallm...@conet.de wrote:
 Hi Lars,

 Hi Lars!

 You are exercising complete cluster communication loss.
 Which is cluster split brain.
 Correct, yes.

 If you are specifically exercising cluster split brain, why are you 
 surprised that you get exactly that?

 Because ping(d) is supposed to keep ressources from starting on nodes which 
 are not properly connected to the network. Thus: Still split brain, but no 
 possibility for concurrent (and possibly damaging) access to resources.

According to your configuration, it can be up to 60s before we'll
detect a change in external connectivity.
Thats plenty of time for the cluster to start resources.

Maybe shortening the monitor interval will help you.
Couldn't hurt.


 You need to reduce the probability to run into complete communication loss, 
 by
 - using multiple communication links.

 There will be *one* dedicated (mpls) line between the two sites. No 
 possibility for any real redundant links; honestly, believe me. The only 
 way would be the usage of GSM modems or other wireless links, which is not 
 possible for several other reasons (which I can't discuss here).

 - using a real quorum
  (there is no quorum in a two node failover cluster)

 Yes, and there is no proper way to use DRBD in a three node cluster.

How is one related to the other?
No-one said the third node had to run anything.

 Until then (unless we have a dedicated, replicated, shared storage, which we 
 don't have, unfortunately), it's a two node cluster or nothing.  This - 
 inevitably - leads to the need for an external quorum, and ping(d) seems to 
 do that, as far as I understood the docs. Please correct me if I'm wrong.

 You may want to still guard against the ugly effects of cluster split brain, 
 by
 - implementing stonith
 - configuring stonith properly

 There's no proper way for doing stonith in a split-site scenario, besides 
 meatware. If the link is down between the two sites, you won't be able to 
 access any ILO, UPS or other stonith device.

 - additionally configuring fencing in DRBD

 Yes, I'm going to try that.

 Still: Please tell me if ping(d) is behaving properly or if it isn't. You've 
 seen my configuration. I think it should work (and, indeed, it did a while 
 ago; it could well be that we misconfigured something after that, but I just 
 can't find what it is...

 THANKS,

 Andreas

 --
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com

 DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 
 CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
 Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
 Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke 
 Höfer
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] Bug in crm shell or pengine

2011-04-19 Thread Andrew Beekhof
On Mon, Apr 18, 2011 at 11:38 PM, Serge Dubrouski serge...@gmail.com wrote:
 Ok, I've read the documentation. It's not a bug, it's a feature :-)

Might be nice if the shell could somehow prevent such configs, but it
would be non-trivial to implement.


 On Mon, Apr 18, 2011 at 3:01 PM, Serge Dubrouski serge...@gmail.com wrote:
 Hello -

 Looks like there is a bug in crm shell Pacemaker version 1.1.5 or in pengine.


 primitive pg_drbd ocf:linbit:drbd \
        params drbd_resource=drbd0 \
        op monitor interval=60s role=Master timeout=10s \
        op monitor interval=60s role=Slave timeout=10s

 Log file:

 Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Operation
 pg_drbd-monitor-60s-0 is a duplicate of pg_drbd-monitor-60s
 Apr 17 04:05:29 cs51 crmd: [5535]: info: do_state_transition: Starting
 PEngine Recheck Timer
 Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Do not use the
 same (name, interval) combination more than once per resource
 Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Operation
 pg_drbd-monitor-60s-0 is a duplicate of pg_drbd-monitor-60s
 Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Do not use the
 same (name, interval) combination more than once per resource
 Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Operation
 pg_drbd-monitor-60s-0 is a duplicate of pg_drbd-monitor-60s

 Plus strange behavior of the cluster like inability to mover resources
 from one node to another.

 --
 Serge Dubrouski.




 --
 Serge Dubrouski.
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] Problem using Stonith external/ipmi device

2011-04-19 Thread Andrew Beekhof
On Tue, Apr 19, 2011 at 12:43 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 Hi,

 On Mon, Apr 11, 2011 at 09:41:12AM +0200, Andrew Beekhof wrote:
 On Fri, Apr 8, 2011 at 11:07 AM, Matthew Richardson
 m.richard...@ed.ac.uk wrote:
  On 07/04/11 16:36, Dejan Muhamedagic wrote:
  For whatever reason stonith-ng doesn't think that
  stonithipmidisk1 can manage this node. Which version of
  Pacemaker do you run? Perhaps this has been fixed in the
  meantime. I cannot recall right now if there has been such a
  problem, but it's possible. You can also try to turn debug on
  and see if there are more clues.
 
  I'm using Pacemaker 1.1.5 from the clusterlabs rpm-next repositories on 
  el5.
 
  I've tried turning on debug, but there's no more information coming out
  in the logs.

 man stonithd has the bits you need.
 start with pcmk_host_check

 That defaults to dynamic-list which should query the resource.
 Right?

Right.

 Apparently, something's not quite ok there.

the list command doesn't work perhaps?

 BTW, I've
 been doing tests with external/ssh and it did work fine.

also fine with fence_xvm
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How can I add more options to current reset, on, off options?

2011-04-18 Thread Andrew Beekhof
On Sun, Apr 17, 2011 at 11:23 PM, Avestan babak_khoram...@hotmail.com wrote:

 Hello,

 I am using STONITH Device AP9225EXP with AP9617 Network Management card.  I
 have generated my own pacth to change the apcmaster.c file to work with my
 setup.

 The stonith appears to allow only three commands (reset - on - off) to be
 passed on to the STONITH Device using numbers 1, 2, 3 which reflected the
 selection of option 1, 2, and 3 on the outlet control console as show here:


 [root@dizin stonith]# stonith
 usage:
         stonith [-svh] -L
         stonith -n -t stonith-device-type
         stonith [-svh] -t stonith-device-type [-p stonith-device-parameters
 | -F stonith-device-parameters-file] -lS
         stonith [-svh] -t stonith-device-type [-p stonith-device-parameters
 | -F stonith-device-parameters-file] -T {reset|on|off} nodename

 where:
        -L      list supported stonith device types
        -l      list hosts controlled by this stonith device
        -S      report stonith device status
        -s      silent
        -v      verbose
        -n      output the config names of stonith-device-parameters
        -h      display detailed help message with stonith device
 desriptions
 [root@dizin stonith]#

 The question is how I can add more possibilities to the list as the
 AP9225EXP is capable of doing more and I would like to take advantage of it.

You mostly can't.
But you can change what action your stonith agent performs when it
receives one of the allowed values.

 Here is what AP9225EXP offers at it own outlet control console:

 1

 --- Outlet Control 1:5
 

        Outlet Name : monitor
        Outlet State: ON
        Control Mode: Graceful Shutdown

     1- Immediate On
     2- Delayed On
     3- Immediate Off
     4- Immediate Reboot
     5- Graceful Reboot
     6- Shutdown
     7- Override
     8- Cancel

     ?- Help, ESC- Back, ENTER- Refresh, CTRL-L- Event Log

 Can someone tell me which file it is that I need to manipulate in order to
 increase the available options to the Ap9225EXP possibilities?

 Thanks,

 Avestan




 --
 View this message in context: 
 http://old.nabble.com/How-can-I-add-more-options-to-current-reset%2C-on%2C-off-options--tp31419415p31419415.html
 Sent from the Linux-HA mailing list archive at Nabble.com.

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Shutdown Escalation

2011-04-16 Thread Andrew Beekhof
On Sat, Apr 16, 2011 at 12:30 PM, yash er.bhara...@yahoo.in wrote:
 yash er.bharat09 at yahoo.in writes:


 Andrew Beekhof andrew at beekhof.net writes:

   I am facing problem during heartbeat stop command as it hangs and
   returns after long 20 min, through google i came to know about shutdown
   escalation parameter of crmd but when i try reducing this parameter it
    do not read the configuration right and again uses 20 min default value.
 
  IIRC the name of some of the parameters changed slightly in the last 5
 years.
  But I don't have such versions around to say for sure.
 
  Try replacing any '-' characters with '_'
 thanks for reply
 i will try this parameter and let u know...

 it is still taking the same value 20 min.
 in CIB i am using this parameter as

 part of cib file
 crm_config
       cluster_property_set id=4e816a85-e6a7-4844-af58-e16f595f1885
         attributes
           nvpair id=1 name=default_resource_stickiness value=INFINITY/
           nvpair name=no_quorum_policy id=a6eb4bbe-c1e2-4ac4-928c-
 a0f881a6f46c value=ignore/
         /attributes
       /cluster_property_set
       cluster_property_set id=cib-bootstrap-options
         attributes
           nvpair id=cib-bootstrap-options-shutdown_escalation
 name=shutdown_escalation value=5min/
         /attributes
       /cluster_property_set
 /crm_config

 is this right way to use this parameter

Looks right. You might be experiencing a 5 year old bug.
Definitely time to upgrade.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] Dovecot OCF Resource Agent

2011-04-15 Thread Andrew Beekhof
On Fri, Apr 15, 2011 at 12:53 PM, Raoul Bhatia [IPAX] r.bha...@ipax.at wrote:
 On 04/15/2011 11:10 AM, jer...@intuxicated.org wrote:

 Yes, it does the same thing but contains some additional features, like
 logging into a mailbox.

 first of all, i do not know how the others think about a ocf ra
 implemented in c. i'll suggest waiting for comments from dejan or
 fghass.

the ipv6addr agent was written in C too
the OCF standard does not dictate the language to be used - its really
a matter of whether C is the best tool for this job


 you could then create a fork on github and make sure it integrates
 well with the current build environment.


 second, what do you think about extending this ra to be able to handle
 multiple email MDAs? deep probing routines would also be needed for
 other MDAs.


 i'm thinking about giving this ra a shot but would like to hear some
 comments on my first remark before doing so.


 thanks for your work!
 raoul
 --
 
 DI (FH) Raoul Bhatia M.Sc.          email.          r.bha...@ipax.at
 Technischer Leiter

 IPAX - Aloy Bhatia Hava OG          web.          http://www.ipax.at
 Barawitzkagasse 10/2/2/11           email.            off...@ipax.at
 1190 Wien                           tel.               +43 1 3670030
 FN 277995t HG Wien                  fax.            +43 1 3670030 15
 
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] Shutdown Escalation

2011-04-15 Thread Andrew Beekhof
On Fri, Apr 15, 2011 at 8:44 AM, yash er.bhara...@yahoo.in wrote:
 Hello list,

 I am facing problem during heartbeat stop command as it hangs and
 returns after long 20 min, through google i came to know about shutdown
 escalation parameter of crmd but when i try reducing this parameter it
  do not read the configuration right and again uses 20 min default value.

IIRC the name of some of the parameters changed slightly in the last 5 years.
But I don't have such versions around to say for sure.

Try replacing any '-' characters with '_'


 I have 3 node cluster and to reproduce the heartbeat hang issue i have
  used one script. Script follows:

 sleep 50sec
 stop heartbeat
 sleep 30 sec
 start heartbeat

 is it right behaviour for heartbeat 2.0.5 using crm mode or is
 there any other option to stop this hang issue..

 any help..

 Regards
 Yash


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] question about lsb init script

2011-04-14 Thread Andrew Beekhof
Probably not lsb compliant.

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html

On Wed, Apr 13, 2011 at 10:58 PM, Gerry Kernan
gerry.ker...@infinityit.ie wrote:
 Hi

 I am setting up a asterisk HA solution using a redfone device for the PRI 
 lines. To start the device i need to run /etc/init.d/fonulator.init i have 
 added this as a resource but it won't start and give an error as below on crm 
 status output.
 The config is below , hopefully someone can point out where i am going wrong.

 res_fonulator.init_fonulator   (lsb:fonulator.init):   Started 
 ho-asterisk2-11314.interlink.local (unmanaged) FAILED

 out of crm configure show

 node $id=a7314e15-8bb1-4b2e-a732-888db0c7b7d7 
 ho-asterisk1-11315.interlink.local
 node $id=c0630b83-3c16-49c7-a55a-2c65ea0155ed 
 ho-asterisk2-11314.interlink.local
 primitive res_Filesystem_1 ocf:heartbeat:Filesystem \
                params device=/dev/drbd0 directory=/rep/ fstype=ext3 \
                operations $id=res_Filesystem_1-operations \
                op start interval=0 timeout=60 \
                op stop interval=0 timeout=60 \
                op monitor interval=20 timeout=40 start-delay=0
 primitive res_IPaddr2_IPaddr ocf:heartbeat:IPaddr2 \
                params ip=10.1.2.98 nic=eth0 cidr_netmask=24 \
                operations $id=res_IPaddr2_IPaddr-operations \
                op start interval=0 timeout=20 \
                op stop interval=0 timeout=20 \
                op monitor interval=10 timeout=20 start-delay=0
 primitive res_dahdi_dahdi lsb:dahdi \
                operations $id=res_dahdi_dahdi-operations \
                op start interval=0 timeout=15 \
                op stop interval=0 timeout=15 \
                op monitor interval=15 timeout=15 start-delay=15
 primitive res_drbd_1 ocf:linbit:drbd \
                params drbd_resource=asterisk \
                operations $id=res_drbd_1-operations \
                op start interval=0 timeout=240 \
                op promote interval=0 timeout=90 \
                op demote interval=0 timeout=90 \
                op stop interval=0 timeout=100 \
                op monitor interval=10 timeout=20 start-delay=0
 primitive res_fonulator.init_fonulator lsb:fonulator.init \
                operations $id=res_fonulator.init_fonulator-operations \
                op start interval=0 timeout=15 \
                op stop interval=0 timeout=15 \
                op monitor interval=15 timeout=15 start-delay=15
 primitive res_httpd_httpd lsb:httpd \
                operations $id=res_httpd_httpd-operations \
                op start interval=0 timeout=15 \
                op stop interval=0 timeout=15 \
                op monitor interval=15 timeout=15 start-delay=15
 primitive res_mysqld_mysql lsb:mysqld \
                operations $id=res_mysqld_mysql-operations \
                op start interval=0 timeout=15 \
                op stop interval=0 timeout=15 \
                op monitor interval=15 timeout=15 start-delay=15
 ms ms_drbd_1 res_drbd_1 \
                meta clone-max=2 notify=true
 colocation col_res_Filesystem_1_ms_drbd_1 inf: res_Filesystem_1 
 ms_drbd_1:Master
 order ord_ms_drbd_1_res_Filesystem_1 inf: ms_drbd_1:promote 
 res_Filesystem_1:start
 property $id=cib-bootstrap-options \
                default-resource-stickiness=100 \
                stonith-enabled=false \
                stonith-action=poweroff \
                dc-version=1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 \
                default-resource-failure-stickiness=100 \
                no-quorum-policy=ignore \
                cluster-infrastructure=Heartbeat \
                last-lrm-refresh=1302723679


 /etc/init.d/fonutalor.init


 #!/bin/bash
 #
 # fonulator   Starts and Stops the Redfone fonulator utility
 #
 # chkconfig: - 60 50
 # description: Utility for configuring the Redfone fonebridge
 #
 # processname: fonulator
 # config: /etc/redfone.conf

 # Source function library.
 . /etc/rc.d/init.d/functions

 # Source networking configuration.
 . /etc/sysconfig/network

 # Check that networking is up.
 [ ${NETWORKING} = no ]  exit 0

 [ -x /usr/local/bin/fonulator ] || exit 0

 RETVAL=0
 prog=fonulator

 start() {
        # Start daemons.

        if [ -d /etc/ ] ; then
                for i in `ls /etc/redfone.conf`; do
                        site=`basename $i .conf`
                        echo -n $Starting $prog for $site: 
                        /usr/local/bin/fonulator $i
                        RETVAL=$?
                        [ $RETVAL -eq 0 ]  {
                           touch /var/lock/subsys/$prog
                           success $$prog $site
                        }
                        echo
                done
        else
                RETVAL=1
        fi
        return $RETVAL
 }

 stop() {
        # Stop daemons.
        echo -n $Shutting down $prog: 
        killproc $prog
        RETVAL=$?
        echo
        [ $RETVAL -eq 0 ]  

Re: [Linux-HA] question about lsb init script

2011-04-14 Thread Andrew Beekhof
On Thu, Apr 14, 2011 at 11:18 AM, Gerry Kernan
gerry.ker...@infinityit.ie wrote:
 Andrew,

 Thanks, I've  do some checking and it doesn't appear to be. Can I add a 
 resource that runs a command and doesn't look for a status for the resource.

No. The status operation is required to be implemented by the script.
Any script that does not is also not an LSB init script.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Resource-Group won't start - crm_mon does not react - no failures shown

2011-04-13 Thread Andrew Beekhof
On Wed, Apr 13, 2011 at 12:23 AM, Stallmann, Andreas
astallm...@conet.de wrote:
 Hi!

 We've got a pretty straightforward and easy configuration:

 Corosync 1.2.1 / Pacemaker 2.0.0 on OpenSuSE 11.3 running DRBD (M/S), Ping 
 (clone), and a resource-group, containing a shared IP, tomcat and mysql 
 (where the datafiles of mysql reside on the DRBD). The cluster consists of 
 two virtual machines running on VMware ESXi 4.

 Since we moved the cluster to an other vmware esxi, strange things happen:

 While DRBD and the ping resource come up on both nodes, the resource group 
 appl_grp (see below) doesn't. No failures are shown in crm_mon and the 
 failcount is zero.

Is host_list=191.224.111.1 191.224.111.78 194.25.2.129 still valid?
If no value is being set for pingd then I can imagine this would be the result.


 Output of crm_mon:
 ~
 
 Last updated: Tue Apr 12 23:39:39 2011
 Stack: openais
 Current DC: cms-appl02 - partition with quorum
 Version: 1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10
 2 Nodes configured, 2 expected votes
 3 Resources configured.
 

 Online: [ cms-appl01 cms-appl02 ]

 Master/Slave Set: ms_drbd_r0
     Masters: [ cms-appl01 ]
     Slaves: [ cms-appl02 ]
 Clone Set: pingy_clone
     Started: [ cms-appl01 cms-appl02 ]
 ~~
 Normally, I'd at least saw the resource group as stoped, but now it doesn't 
 even turn up in the crm_mon display!

 The crm-Tool at least shows, that the resources still exist:

 ~~
 crm(live)# resource
 crm(live)resource# show
 Resource Group: appl_grp
     fs_r0      (ocf::heartbeat:Filesystem) Stopped
     sharedIP   (ocf::heartbeat:IPaddr2) Stopped
     tomcat_res (ocf::heartbeat:tomcat) Stopped
     database_res       (ocf::heartbeat:mysql) Stopped
 Master/Slave Set: ms_drbd_r0
     Masters: [ cms-appl01 ]
     Slaves: [ cms-appl02 ]
 Clone Set: pingy_clone
     Started: [ cms-appl01 cms-appl02 ]
 ~~~

 And finally, here's our configuration:

 ~~output of crm configure show
 node cms-appl01
 node cms-appl02
 primitive database_res ocf:heartbeat:mysql \
        params binary=/usr/bin/mysqld_safe config=/etc/my.cnf 
 datadir=/drbd/mysql user=mysql 
 log=/var/log/mysql/mysqld.logpid=/var/run/mysql/mysqld.pid 
 socket=/drbd/run/mysql/mysql.sock \
        op start interval=0 timeout=120s \
        op stop interval=0 timeout=120s \
        op monitor interval=10s timeout=30s \
        op notify interval=0 timeout=90s
 primitive drbd_r0 ocf:linbit:drbd \
        params drbd_resource=r0 \
        op monitor interval=15s \
        op start interval=0 timeout=240s \
        op stop interval=0 timeout=100s
 primitive fs_r0 ocf:heartbeat:Filesystem \
        params device=/dev/drbd0 directory=/drbd fstype=ext4 \
        op start interval=0 timeout=60s \
        op stop interval=0 timeout=60s
 primitive pingy_res ocf:pacemaker:ping \
        params dampen=5s multiplier=1000 host_list=191.224.111.1 
 191.224.111.78 194.25.2.129 \
        op monitor interval=60s timeout=60s \
        op start interval=0 timeout=60s
 primitive sharedIP ocf:heartbeat:IPaddr2 \
        params ip=191.224.111.50 cidr_netmask=255.255.255.0 nic=eth0:0
 primitive tomcat_res ocf:heartbeat:tomcat \
        params java_home=/etc/alternatives/jre \
        params catalina_home=/usr/share/tomcat6 \
        op start interval=0 timeout=60s \
        op stop interval=0 timeout=120s \
        op monitor interval=10s timeout=30s
 group appl_grp fs_r0 sharedIP tomcat_res database_res \
        meta target-role=Started
 ms ms_drbd_r0 drbd_r0 \
        meta master-max=1 master-node-max=1 clone-max=2 
 clone-node-max=1 notify=true
 clone pingy_clone pingy_res
 location appl_loc appl_grp 100: cms-appl01
 location only-if-connected appl_grp \
        rule $id=only-if-connected-rule -inf: not_defined pingd or pingd lte 
 2000
 colocation appl_grp-only-on-master inf: appl_grp ms_drbd_r0:Master
 order appl_grp-after-drbd inf: ms_drbd_r0:promote appl_grp:start
 order mysql-after-fs inf: fs_r0 database_res
 property $id=cib-bootstrap-options \
        stonith-enabled=false \
        no-quorum-policy=ignore \
        stonith-action=poweroff \
        default-resource-stickiness=100 \
        dc-version=1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10 \
        cluster-infrastructure=openais \
       expected-quorum-votes=2 \
        last-lrm-refresh=1302643565
 ~

 When I (re)activate the appl_grp, literarily nothing happens:

 crm(live)resource# start nag_grp

 No new entries in /var/log/messages, no visible changes in crm_mon. It is as 
 if the resource didn't exist.

 Any ideas? You'll find the logs below.

 Cheers and good night,

 Andreas

 I found only one error message in /var/log/messages:
 

Re: [Linux-HA] Filesystem thinks it is run as a clone

2011-04-13 Thread Andrew Beekhof
On Tue, Apr 12, 2011 at 5:17 PM, Christoph Bartoschek
bartosc...@or.uni-bonn.de wrote:
 Hi,

 today we tested some NFS cluster scenarios and the first test failed.
 The first test was to put the current master node into standby. Stopping
 the services worked but then starting it on the other node failed. The
 ocf:hearbeat:Filesystem resource failed to start. In the logfile we see:

 Apr 12 14:08:42 laplace Filesystem[10772]: [10820]: INFO: Running start
 for /dev/home-data/afs on /srv/nfs/afs
 Apr 12 14:08:42 laplace Filesystem[10772]: [10822]: ERROR: DANGER! ext4
 on /dev/home-data/afs is NOT cluster-aware!
 Apr 12 14:08:42 laplace Filesystem[10772]: [10824]: ERROR: DO NOT RUN IT
 AS A CLONE!

To my eye the Filesystem agent looks confused

 The message comes from the following code in ocf:hearbeat:Filesystem:

 case $FSTYPE in
 ocfs2)  ocfs2_init
         ;;
 nfs|smbfs|none|gfs2)    : # this is kind of safe too
         ;;
 *)      if [ -n $OCF_RESKEY_CRM_meta_clone ]; then
                 ocf_log err DANGER! $FSTYPE on $DEVICE is NOT
 cluster-aware!
                 ocf_log err DO NOT RUN IT AS A CLONE!
                 ocf_log err Politely refusing to proceed to avoid data
 corruption.
                 exit $OCF_ERR_CONFIGURED
         fi
         ;;
 esac


 The message is only printed if the variable OCF_RESKEY_CRM_meta_clone is
 non-zero. Our configuration however does not run the filesystem as a
 clone. Somehow the OCF_RESKEY_CRM_meta_clone variable leaked into the
 start of the Filesystem resource. Is this a known bug?

 Or is there a configuration error on our side? Here is the current
 configuration:


 node laplace \
         attributes standby=off
 node ries \
         attributes standby=off
 primitive ClusterIP ocf:heartbeat:IPaddr2 \
         params ip=192.168.143.228 cidr_netmask=24 \
         op monitor interval=30s \
         meta target-role=Started
 primitive p_drbd_nfs ocf:linbit:drbd \
         params drbd_resource=home-data \
         op monitor interval=15 role=Master \
         op monitor interval=30 role=Slave
 primitive p_exportfs_afs ocf:heartbeat:exportfs \
         params fsid=1 directory=/srv/nfs/afs
 options=rw,no_root_squash,mountpoint \
         clientspec=192.168.143.0/255.255.255.0 \
         wait_for_leasetime_on_stop=false \
         op monitor interval=30s
 primitive p_fs_afs ocf:heartbeat:Filesystem \
         params device=/dev/home-data/afs directory=/srv/nfs/afs \
        fstype=ext4 \
         op monitor interval=10s \
         meta target-role=Started
 primitive p_lsb_nfsserver lsb:nfs-kernel-server \
         op monitor interval=30s
 primitive p_lvm_nfs ocf:heartbeat:LVM \
         params volgrpname=home-data \
         op monitor interval=30s
 group g_nfs p_lvm_nfs p_fs_afs p_exportfs_afs ClusterIP
 ms ms_drbd_nfs p_drbd_nfs \
         meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 clone cl_lsb_nfsserver p_lsb_nfsserver
 colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master
 order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start
 property $id=cib-bootstrap-options \
         dc-version=1.0.9-unknown \
         cluster-infrastructure=openais \
         expected-quorum-votes=2 \
         stonith-enabled=false \
         no-quorum-policy=ignore \
         last-lrm-refresh=1302610197
 rsc_defaults $id=rsc-options \
         resource-stickiness=200


 Thanks
 Christoph
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Filesystem thinks it is run as a clone

2011-04-13 Thread Andrew Beekhof
On Wed, Apr 13, 2011 at 10:57 AM, Christoph Bartoschek
bartosc...@or.uni-bonn.de wrote:
 Am 13.04.2011 08:26, schrieb Andrew Beekhof:
 On Tue, Apr 12, 2011 at 5:17 PM, Christoph Bartoschek
 bartosc...@or.uni-bonn.de  wrote:
 Hi,

 today we tested some NFS cluster scenarios and the first test failed.
 The first test was to put the current master node into standby. Stopping
 the services worked but then starting it on the other node failed. The
 ocf:hearbeat:Filesystem resource failed to start. In the logfile we see:

 Apr 12 14:08:42 laplace Filesystem[10772]: [10820]: INFO: Running start
 for /dev/home-data/afs on /srv/nfs/afs
 Apr 12 14:08:42 laplace Filesystem[10772]: [10822]: ERROR: DANGER! ext4
 on /dev/home-data/afs is NOT cluster-aware!
 Apr 12 14:08:42 laplace Filesystem[10772]: [10824]: ERROR: DO NOT RUN IT
 AS A CLONE!

 To my eye the Filesystem agent looks confused

 The agent is confused because OCF_RESKEY_CRM_meta_clone is non-zero. Is
 this something that can happen?

Not unless the resource has been cloned - and looking at the config
this did not seem to be the case.
Or did I miss something?



 The message comes from the following code in ocf:hearbeat:Filesystem:

 case $FSTYPE in
 ocfs2)  ocfs2_init
          ;;
 nfs|smbfs|none|gfs2)    : # this is kind of safe too
          ;;
 *)      if [ -n $OCF_RESKEY_CRM_meta_clone ]; then
                  ocf_log err DANGER! $FSTYPE on $DEVICE is NOT
 cluster-aware!
                  ocf_log err DO NOT RUN IT AS A CLONE!
                  ocf_log err Politely refusing to proceed to avoid data
 corruption.
                  exit $OCF_ERR_CONFIGURED
          fi
          ;;
 esac


 The message is only printed if the variable OCF_RESKEY_CRM_meta_clone is
 non-zero. Our configuration however does not run the filesystem as a
 clone. Somehow the OCF_RESKEY_CRM_meta_clone variable leaked into the
 start of the Filesystem resource. Is this a known bug?

 Or is there a configuration error on our side? Here is the current
 configuration:


 node laplace \
          attributes standby=off
 node ries \
          attributes standby=off
 primitive ClusterIP ocf:heartbeat:IPaddr2 \
          params ip=192.168.143.228 cidr_netmask=24 \
          op monitor interval=30s \
          meta target-role=Started
 primitive p_drbd_nfs ocf:linbit:drbd \
          params drbd_resource=home-data \
          op monitor interval=15 role=Master \
          op monitor interval=30 role=Slave
 primitive p_exportfs_afs ocf:heartbeat:exportfs \
          params fsid=1 directory=/srv/nfs/afs
 options=rw,no_root_squash,mountpoint \
          clientspec=192.168.143.0/255.255.255.0 \
          wait_for_leasetime_on_stop=false \
          op monitor interval=30s
 primitive p_fs_afs ocf:heartbeat:Filesystem \
          params device=/dev/home-data/afs directory=/srv/nfs/afs \
         fstype=ext4 \
          op monitor interval=10s \
          meta target-role=Started
 primitive p_lsb_nfsserver lsb:nfs-kernel-server \
          op monitor interval=30s
 primitive p_lvm_nfs ocf:heartbeat:LVM \
          params volgrpname=home-data \
          op monitor interval=30s
 group g_nfs p_lvm_nfs p_fs_afs p_exportfs_afs ClusterIP
 ms ms_drbd_nfs p_drbd_nfs \
          meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 clone cl_lsb_nfsserver p_lsb_nfsserver
 colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master
 order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start
 property $id=cib-bootstrap-options \
          dc-version=1.0.9-unknown \
          cluster-infrastructure=openais \
          expected-quorum-votes=2 \
          stonith-enabled=false \
          no-quorum-policy=ignore \
          last-lrm-refresh=1302610197
 rsc_defaults $id=rsc-options \
          resource-stickiness=200


 Thanks
 Christoph
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] Resource agent implementing SPC-3 Persistent Reservations (contribution from Evgeny Nifontov)

2011-04-12 Thread Andrew Beekhof
Awesome. I was wondering if someone would ever write one of these :)

On Tue, Apr 12, 2011 at 10:29 AM, Florian Haas florian.h...@linbit.com wrote:
 Hi everyone,

 Evgeny Nifontov has started to implement sg_persist, a resource agent
 managing SPC-3 Persistent Reservations (PRs) using the sg_persist
 binary. He's put up a personal repo on Github and the initial commit is
 here:

 https://github.com/nif/ClusterLabs__resource-agents/commit/d0c46fb35338d28de3e2c20c11d0ad01dded13fd

 I've added some comments for an initial review. Everyone interested
 please pitch in.

 Thanks to Evgeny for an the contribution!

 Cheers,
 Florian



 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] Problem using Stonith external/ipmi device

2011-04-11 Thread Andrew Beekhof
On Fri, Apr 8, 2011 at 11:07 AM, Matthew Richardson
m.richard...@ed.ac.uk wrote:
 On 07/04/11 16:36, Dejan Muhamedagic wrote:
 For whatever reason stonith-ng doesn't think that
 stonithipmidisk1 can manage this node. Which version of
 Pacemaker do you run? Perhaps this has been fixed in the
 meantime. I cannot recall right now if there has been such a
 problem, but it's possible. You can also try to turn debug on
 and see if there are more clues.

 I'm using Pacemaker 1.1.5 from the clusterlabs rpm-next repositories on el5.

 I've tried turning on debug, but there's no more information coming out
 in the logs.

man stonithd has the bits you need.
start with pcmk_host_check



 Thanks,

 Matthew

 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] HA software download

2011-04-07 Thread Andrew Beekhof
On Wed, Apr 6, 2011 at 1:28 PM, Ajaykumar Narayanaswamy
ajaykumar_narayanasw...@mindtree.com wrote:
 Hi All,

 I would like to know whether Linux OS has any inbuilt HA/Failover software or 
 should we procure some third-party HA s/w.

 I came to know about heartbeat package which is an Open source application 
 and also have downloaded the same, but does this help in providing failover 
 for LDAP Server running on Linux OS for about 2000 SAP Users who would be 
 using it for authentication.

Heartbeat and/or Pacemaker (the bit that used to be the crm in
heartbeat v2) are shipped by most major distributions and can handle
this kind of task.


 Looking forward to hearing from you.

 Regards,
 Ajay

 

 http://www.mindtree.com/email/disclaimer.html
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] When is the next release for resource agents?

2011-04-07 Thread Andrew Beekhof
On Wed, Apr 6, 2011 at 4:55 PM, Serge Dubrouski serge...@gmail.com wrote:
 Hello -

 When is the next release for resource agents? Agents that come with
 resource-agents-1.0.3-2.6.el5 form clusterlabs repository are very
 outdated.pgsql is at least one year old or so.

in most cases there's not really a need for clusterlabs to ship the
entire stack anymore (plus its a heap of work for me).
instead the pacemaker packages simply build against whatever the
distro provides.

 el5 is the exception, i'll try to update it there in the coming days.


 --
 Serge Dubrouski.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] HA software download

2011-04-07 Thread Andrew Beekhof
On Thu, Apr 7, 2011 at 12:12 PM, Ajaykumar Narayanaswamy
ajaykumar_narayanasw...@mindtree.com wrote:
 Hi Andrew,

 I have a small query, is it okay to have two LDAP servers running on Linux OS 
 configured in Active/Active mode, else they should be configured in 
 Active/Active mode, because as per my knowledge..

 LDAP HA should be configured in Active/passive, one server should be enabled 
 in R/W mode and other should be in R/O mode. so that we can sync the R/O or 
 else syncing will be a problem. and in case the active goes down we can make 
 the passive live by changing the IP address.

 Could you please throw some light on this query???

No. I've never run an ldap server. Sorry.


 Thx for lending help..

 Regards,
 Ajaykumar

 -Original Message-
 From: Ajaykumar Narayanaswamy
 Sent: Thursday, April 07, 2011 12:41 PM
 To: 'Andrew Beekhof'
 Subject: RE: [Linux-HA] HA software download

 Thx a lot Andrew..

 Indeed a great help, thx a lot once again.

 Regards,
 Ajaykumar

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Thursday, April 07, 2011 12:30 PM
 To: General Linux-HA mailing list
 Cc: Ajaykumar Narayanaswamy
 Subject: Re: [Linux-HA] HA software download

 On Wed, Apr 6, 2011 at 1:28 PM, Ajaykumar Narayanaswamy
 ajaykumar_narayanasw...@mindtree.com wrote:
 Hi All,

 I would like to know whether Linux OS has any inbuilt HA/Failover software 
 or should we procure some third-party HA s/w.

 I came to know about heartbeat package which is an Open source application 
 and also have downloaded the same, but does this help in providing failover 
 for LDAP Server running on Linux OS for about 2000 SAP Users who would be 
 using it for authentication.



 Heartbeat and/or Pacemaker (the bit that used to be the crm in
 heartbeat v2) are shipped by most major distributions and can handle
 this kind of task.


 Looking forward to hearing from you.

 Regards,
 Ajay

 

 http://www.mindtree.com/email/disclaimer.html
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat ordering

2011-04-06 Thread Andrew Beekhof
On Tue, Apr 5, 2011 at 11:58 AM, Maxim Ianoglo dot...@gmail.com wrote:
 Hello,

 I have four serves in a HA cluster:
 NodeA
 NodeB
 NodeC
 NodeD

 There are defined three groups of resources and one inline resource:
 1. group_storage ( NFS VIP, NFS Server, DRBD )
 2. group_apache_www (Domains VIPs and Apache)
 3. group_nginx_www (Static files with nginx)
 4. inline_nfs_client ( NFS client )

 (1) should run only on NodeC or NodeD. NodeC is preferable. NodeD for backup.
 (2) should run on NodeC and NodeD. NodeD is preferable. NodeC for backup.
 (3) should run on NodeC and NodeD. NodeC is preferable. NodeD for backup.
 (4) should run on every node except for node on which (1) is located.

 I have following orders:
 (2) depends on (1) and (4)
 (3) depends on (1) and (4)
 (4) depends on (1)

 Collocations:
 (4) and (1) should not run on same node.

 The issue is that resource (4) chooses NodeC which is the default node for 
 (1), so (1) had to choose another node but NodeC, so it goes to NodeD.
 How can I make resource (1) to choose it's node earlier that (4) and any 
 other resource ?

Swap the order resources are listed in the colocation constraint.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] why Cluster restarts A, before starting B on surviving node.

2011-04-06 Thread Andrew Beekhof
I meant in the form of a hb_report which contains the necessary logs
and status information necessary to diagnose your issue.

On Mon, Apr 4, 2011 at 12:11 PM, Muhammad Sharfuddin
m.sharfud...@nds.com.pk wrote:

 On Mon, 2011-04-04 at 10:42 +0200, Andrew Beekhof wrote:
 On Thu, Mar 24, 2011 at 7:42 PM, Muhammad Sharfuddin
 m.sharfud...@nds.com.pk wrote:
  we have two resources A and B
  Cluster starts A on node1, and B on node2, while failover node for A is
  node2 and failover node for B is node1
 
  B cant start without A, so I have following location rules:
 
           order first_A_then_B : A  B
 
  Problem/Question
  
  now if B fails due to node failure, Cluster restarts A, before
  starting B on surviving node(node1).
 
  my question/problem, is why Cluster restarts A.

 my question/problem, is that you've given us no information on which
 to base a reply.
 SLES 11 SP1 updated
 SLE HAE SP1 + updated
 node1 hostname:

 this is a 'distributed' and/or 'Active/Active', two nodes Cluster.

 Scenario:
 Cluster starts resource A on node1, and resource B on node2, due to
 following location constraints:

  location PrimaryLoc-of-A A +inf: node1
  location PrimaryLoc-of-B B +inf: node2


 B is a resource that is dependent on resource A, therefor I have a order
 constraint:

   order first_A_then_B : A  B

 Now node2 blown, so cluster starts moving resource B(i.e resource 'B'
 failover) on node1(where resource A is already running).. but during
 this process Cluster first stops and starts(restarts) resource A, and
 then starts B.

 Problem/Question:

 Why Cluster restarts resource 'A' during failover process of resource B



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] crm commands : how to reduce the delay between two commands

2011-04-06 Thread Andrew Beekhof
On Fri, Mar 25, 2011 at 2:07 PM, Alain.Moulle alain.mou...@bull.net wrote:
 Hi,
 I tried but it does not work :
 crm_resource -r resname -p target-role -v started
 because it adds a target-role=started as params
 whereis I already have a meta target-role=Stopped
 so resource does not start.
 So I tried :
 crm_resource -r resname -m -p target-role -v started
 then resource starts successfully.
 But with a loop:
 for i in {1..20}; do echo resname$i ; crm_resource -r resname$i -m -p
 target-role -v started; done
 The first one is started immediately, and the 19th other ones are
 started ~20s after the first one
 but all in one salvo.
 So it seems to be quite the same behavior as successive crm resource
 start resname$i commands.
 First command is taken in account immediately, then there is a delay
 perhaps before pooling eventuals
 other crm commands, but as during this delay , my loop has already sent
 19 commands, these are
 taken in account in one shot when the new polling occurs.

 Meaning, that manually, if you wait that the expected result of your crm
 command is displayed on crm_mon,
 before sending the second one etc. there is always this 10 to 20s
 latency between each commands.
 (Same behavior inside scripts if the script waits for the command to be
 really completed by testing ...)

 Hope my description is clear enough ...

Yes.  Looks like something in core pacemaker.
Could you file a bug for this and include the output of your above
testcase but with - added to the crm_resource command line please?
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD and pacemaker interaction

2011-04-05 Thread Andrew Beekhof
On Mon, Apr 4, 2011 at 10:14 PM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Mon, Apr 04, 2011 at 09:43:27AM +0200, Andrew Beekhof wrote:
  I am missing the state: running degraded or suboptimal.
 
  Yep, degraded is not a state available for pacemaker.
  Pacemaker cannot do much about suboptimal.


 Maybe we need to add OCF_RUNNING_BUT_DEGRADED to the OCF spec (and the PE).

 And, of course, OCF_MASTER_BUT_ONLY_ONE_FAILURE_AWAY_FROM_COMPLETE_DATA_LOSS

Feeling quite alright there?

 If it makes people happy to see

  Master/Slave Set: ms_drbd_data (DEGRADED)
     p_drbd_data:0      (ocf::linbit:drbd):     Master bk1 (DEGRADED)
     p_drbd_data:1      (ocf::linbit:drbd):     Slave bk2 (DEGRADED)

 in crm_mon, then sure, go for it.

 Other than that, I don't think that pacemaker can do much about
 degraded resources.

The intention was that the PE would treat it the same as OCF_RUNNING -
hence the name.
It would exist purely to give admin tools the ability to provide
additional feedback to users - like you outlined above.

Essentially it would be a way for the RA to say Something isn't
right, but you (ie. pacemaker) shouldn't do anything about it other
than let a human know.
Anything more complex is WAY out of scope.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD and pacemaker interaction

2011-04-05 Thread Andrew Beekhof
On Tue, Apr 5, 2011 at 9:42 AM, Christoph Bartoschek
bartosc...@or.uni-bonn.de wrote:
 Am 04.04.2011 22:14, schrieb Lars Ellenberg:
 On Mon, Apr 04, 2011 at 09:43:27AM +0200, Andrew Beekhof wrote:
 I am missing the state: running degraded or suboptimal.

 Yep, degraded is not a state available for pacemaker.
 Pacemaker cannot do much about suboptimal.


 Maybe we need to add OCF_RUNNING_BUT_DEGRADED to the OCF spec (and the PE).

 And, of course, OCF_MASTER_BUT_ONLY_ONE_FAILURE_AWAY_FROM_COMPLETE_DATA_LOSS

 What about using the standard output of the monitor operation as a
 status string that is displayed by crm_mon if available?

 I can imagine that such a change is less intrusive.

Far from it, we'd need to start storing the stdout result in the CIB.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD and pacemaker interaction

2011-04-04 Thread Andrew Beekhof
On Fri, Apr 1, 2011 at 8:13 PM, Christoph Bartoschek
bartosc...@or.uni-bonn.de wrote:
 Am 01.04.2011 16:38, schrieb Lars Ellenberg:
 On Fri, Apr 01, 2011 at 11:35:19AM +0200, Christoph Bartoschek wrote:
 Am 01.04.2011 11:27, schrieb Florian Haas:
 On 2011-04-01 10:49, Christoph Bartoschek wrote:
 Am 01.04.2011 10:27, schrieb Andrew Beekhof:
 On Sat, Mar 26, 2011 at 12:10 AM, Lars Ellenberg
 lars.ellenb...@linbit.com    wrote:
 On Fri, Mar 25, 2011 at 06:18:07PM +0100, Christoph Bartoschek wrote:
 I am missing the state: running degraded or suboptimal.

 Yep, degraded is not a state available for pacemaker.
 Pacemaker cannot do much about suboptimal.

 I wonder what it would take to change that.  I suspect either a
 crystal ball or way too much knowledge of drbd internals.

 The RA would be responsible to check this. For drbd any diskstate
 different from UpToDate/UpToDate is suboptimal.

 Have you actually looked at the resource agent? It does already evaluate
 the disk state and adjusts the master preference accordingly. What else
 is there to do?

 Maybe I misunderstood Andrew's comment. I read it this way:  If we
 introduce a new state suboptimal, would it be hard to detect it?

 I just wanted to express that detecting suboptimality seems not to be
 that hard.

 But that state is useless for pacemaker,
 since it cannot do anything about it.

 I thought I made that clear.


 You made clear that pacemaker cannot do anything about it. However
 crm_mon could report it. One may think that is can be neglected. But the
 current output of crm_mon is unexpected for me.

Maybe we need to add OCF_RUNNING_BUT_DEGRADED to the OCF spec (and the PE).
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] NFS cluster after node crash

2011-04-04 Thread Andrew Beekhof
On Thu, Mar 24, 2011 at 9:58 PM, Christoph Bartoschek
bartosc...@or.uni-bonn.de wrote:
 It seems as if the g_nfs service is stopped on the surviving node when
 the other one comes up again.

To me it looks like the service gets stopped after it fails:

p_exportfs_root:0_monitor_3 (node=laplace, call=12, rc=7,
status=complete): not running



 Does anyone see a reason why the service does not continue to run?

 Christoph

 Am 22.03.2011 22:37, schrieb Christoph Bartoschek:
 Hi,

 I've created a NFS cluster after the linbit tutorial Highly available
 NFS storage with DRBD and Pacemaker.  Generally it seems to work fine.
 Today I simlated a node crash by just turning a maschine off. Failover
 went fine. After 17 seconds the second node was able to serve the clients.

 But when I started the crashed node again the service went down. I
 wonder why the cluster did not just restart the services on the new
 node? Instead it tried to change status on the surviving node. What is
 going wrong?

 The resulting status is:


 Online: [ ries laplace ]

    Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]
        Masters: [ ries ]
        Slaves: [ laplace ]
    Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver]
        Started: [ ries laplace ]
    Resource Group: g_nfs
        p_lvm_nfs  (ocf::heartbeat:LVM):   Started ries
        p_fs_afs   (ocf::heartbeat:Filesystem):    Started ries
 (unmanaged) FAILED
        p_ip_nfs   (ocf::heartbeat:IPaddr2):       Stopped
    Clone Set: cl_exportfs_root [p_exportfs_root]
        p_exportfs_root:0  (ocf::heartbeat:exportfs):      Started laplace
 FAILED
        Started: [ ries ]

 Failed actions:
       p_exportfs_root:0_monitor_3 (node=laplace, call=12, rc=7,
 status=complete): not running
       p_fs_afs_stop_0 (node=ries, call=37, rc=-2, status=Timed Out):
 unknown exec error


 My configuration is:


 node laplace \
           attributes standby=off
 node ries \
           attributes standby=off
 primitive p_drbd_nfs ocf:linbit:drbd \
           params drbd_resource=afs \
           op monitor interval=15 role=Master \
           op monitor interval=30 role=Slave
 primitive p_exportfs_root ocf:heartbeat:exportfs \
           params fsid=0 directory=/srv/nfs
 options=rw,no_root_squash,crossmnt
 clientspec=192.168.1.0/255.255.255.0 wait_for_leasetime_on_stop=1 \
           op monitor interval=30s \
           op stop interval=0 timeout=100s
 primitive p_fs_afs ocf:heartbeat:Filesystem \
           params device=/dev/afs/afs directory=/srv/nfs/afs
 fstype=ext4 \
           op monitor interval=10s
 primitive p_ip_nfs ocf:heartbeat:IPaddr2 \
           params ip=192.168.1.100 cidr_netmask=24 \
           op monitor interval=30s \
           meta target-role=Started
 primitive p_lsb_nfsserver lsb:nfsserver \
           op monitor interval=30s
 primitive p_lvm_nfs ocf:heartbeat:LVM \
           params volgrpname=afs \
           op monitor interval=30s
 group g_nfs p_lvm_nfs p_fs_afs p_ip_nfs \
           meta target-role=Started
 ms ms_drbd_nfs p_drbd_nfs \
           meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 clone cl_exportfs_root p_exportfs_root \
           meta target-role=Started
 clone cl_lsb_nfsserver p_lsb_nfsserver \
           meta target-role=Started
 colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master
 colocation c_nfs_on_root inf: g_nfs cl_exportfs_root
 order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start
 order o_nfs_server_before_exportfs inf: cl_lsb_nfsserver
 cl_exportfs_root:start
 order o_root_before_nfs inf: cl_exportfs_root g_nfs:start
 property $id=cib-bootstrap-options \
           dc-version=1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8 \
           cluster-infrastructure=openais \
           expected-quorum-votes=2 \
           stonith-enabled=false \
           no-quorum-policy=ignore \
           last-lrm-refresh=1300828539
 rsc_defaults $id=rsc-options \
           resource-stickiness=200


 Christoph
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Stonith resource appears to be active on 2 nodes ...

2011-04-04 Thread Andrew Beekhof
On Mon, Apr 4, 2011 at 9:03 AM, Alain.Moulle alain.mou...@bull.net wrote:
 Hi,
 I got this error :
 1301591983 2011 Mar 31 19:19:43 berlin5 daemon err crm_resource [36968]:
 ERROR: native_add_running: Resource
 stonith::fence_ipmilan:restofenceberlin4 appears to be active on 2 nodes.
 1301591983 2011 Mar 31 19:19:43 berlin5 daemon warning crm_resource
 [36968]: WARN: See
 http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information
 I check on this URL, and there are two listed potential causes :
 1. the resource is started at boot time : this is for sure not the case.
 2. the monitor op in fence_ipmilan could be implemented not correctly ?
       Is a stonith resource to be mandatorilly an OCF script ?

This should be a higher level issue.  Possibly in stonith-ng.
Logs?


 I check the fence_ipmilan source :
  else if (!strcasecmp(op, status) || !strcasecmp(op, monitor)) {
                printf(Getting status of IPMI:%s...,ip);
                fflush(stdout);
                ret = ipmi_op(i, ST_STATUS, power_status);
                switch(ret) {
                case STATE_ON:
                  if (!strcasecmp(op, status))
                            printf(Chassis power = On\n);
                        translated_ret = ERR_STATUS_ON;
                        ret = 0;
                        break;
                case STATE_OFF:
                  if (!strcasecmp(op, status))
                            printf(Chassis power = Off\n);
                        translated_ret = ERR_STATUS_OFF;
                        ret = 0;
                        break;
                default:
                  if (!strcasecmp(op, status))
                            printf(Chassis power = Unknown\n);
                        translated_ret = ERR_STATUS_FAIL;
                        ret = 1;
                        break;
                }

 Any idea about where could be potentially the problem ?

 (knowing that I think that fence_ipmilan is NOT an OCF script, but a stonith
 script delivered by RH as fence_agents)

 Thanks a lot
 Alain
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] why Cluster restarts A, before starting B on surviving node.

2011-04-04 Thread Andrew Beekhof
On Thu, Mar 24, 2011 at 7:42 PM, Muhammad Sharfuddin
m.sharfud...@nds.com.pk wrote:
 we have two resources A and B
 Cluster starts A on node1, and B on node2, while failover node for A is
 node2 and failover node for B is node1

 B cant start without A, so I have following location rules:

          order first_A_then_B : A  B

 Problem/Question
 
 now if B fails due to node failure, Cluster restarts A, before
 starting B on surviving node(node1).

 my question/problem, is why Cluster restarts A.

my question/problem, is that you've given us no information on which
to base a reply.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] update question

2011-04-01 Thread Andrew Beekhof
On Mon, Mar 28, 2011 at 9:08 PM, Miles Fidelman
mfidel...@meetinghouse.net wrote:
 Hi Folks,

 I'm getting ready to upgrade a 2-node HA cluster from Debian Etch to
 Squeeze.  I'd very much appreciate any suggestions regarding gotchas
 to avoid, and so forth.

 Basic current configuration:
 - 2 vanilla Intel-based servers, 4 SATA drives each
 - each machine: disks set up as 4-drive md10, LVM
 - xen 3.2 hypervisor, Debian Etch Dom0
 - DRBD 8.2, Pacemaker linking the two nodes
 - several Debian Etch DomUs

 Target configuration:
 - update to xen 4.1, Debian Squeeze Dom0
 - update to latest DRBD, Pacemaker
 - update DomUs on a case-by-case basis

 The most obvious question:  Are later versions of Xen, DRBD, Pacemaker
 compatible with the older ones?  I.e., can I take the simple approach:
 - migrate all DomUs to one machine
 - take the other machine off-line
 - upgrade Debian, Xen, DRBD, Pacemaker on the off-line node
 - bring that node back on-line (WILL THE NEW VERSIONS OF XEN, DRBD,
 PACEMAKER SYNC WITH THE PREVIOUS RELEASES ON THE OTHER NODE???)

I can only comment on Pacemaker:
  I think so, but you haven't indicated which versions

 - migrate stuff to the updated node
 - take the 2nd node off-line, update everything, bring it back up, resync

 1. Will that work?  (If not: Alternative suggestions?)
 2. Anything to watch out for?
 3. As an alternative, does it make sense to install Ganeti and use it to
 manage the cluster?  If so, any suggestions on a migration path?

 (Yes, it would be easier if I had one or two additional servers to use
 for intermediate staging, but such is life.)

 Thanks Very Much,

 Miles Fidelman

 --
 In theory, there is no difference between theory and practice.
 Infnord  practice, there is.    Yogi Berra


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD and pacemaker interaction

2011-04-01 Thread Andrew Beekhof
On Fri, Apr 1, 2011 at 4:38 PM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Fri, Apr 01, 2011 at 11:35:19AM +0200, Christoph Bartoschek wrote:
 Am 01.04.2011 11:27, schrieb Florian Haas:
  On 2011-04-01 10:49, Christoph Bartoschek wrote:
  Am 01.04.2011 10:27, schrieb Andrew Beekhof:
  On Sat, Mar 26, 2011 at 12:10 AM, Lars Ellenberg
  lars.ellenb...@linbit.com   wrote:
  On Fri, Mar 25, 2011 at 06:18:07PM +0100, Christoph Bartoschek wrote:
  I am missing the state: running degraded or suboptimal.
 
  Yep, degraded is not a state available for pacemaker.
  Pacemaker cannot do much about suboptimal.
 
  I wonder what it would take to change that.  I suspect either a
  crystal ball or way too much knowledge of drbd internals.
 
  The RA would be responsible to check this. For drbd any diskstate
  different from UpToDate/UpToDate is suboptimal.
 
  Have you actually looked at the resource agent? It does already evaluate
  the disk state and adjusts the master preference accordingly. What else
  is there to do?

 Maybe I misunderstood Andrew's comment. I read it this way:  If we
 introduce a new state suboptimal, would it be hard to detect it?

No, detecting is the easy part.

 I just wanted to express that detecting suboptimality seems not to be
 that hard.

 But that state is useless for pacemaker,
 since it cannot do anything about it.

This was the part I was wondering about - if pacemaker _could_ do
something intelligent.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Sort of crm commandes but off line ?

2011-03-25 Thread Andrew Beekhof
On Thu, Mar 24, 2011 at 2:32 PM, Alain.Moulle alain.mou...@bull.net wrote:
 Hi,
 Ok I think my question was not clear : in fact, the pb is not to
 do or not ssh node crm ... , the pb is just to know the
 hostname of the node to ssh it , in another way than parsing
 the cib.xml to know which other nodes are in the same HA cluster
 as the node where I am (knowing that corosync is stopped on this
 local node) .

Add a floating IP with no quorum requirement (so that its always
running as long as at least one node is) and set up an A record
pointing clusterX.bull.net to it?

 Thanks
 Regards.
 Alain
 This might sound obvious but is an ssh call acceptable?

 On 3/23/2011 8:38 AM, Alain.Moulle wrote:

 Hi,

 I'm looking for a command which will give to me information of the HA
 cluster ,
 such as for example all nodes hostnames which are in the same HA cluster BUT
 from a node where Pacemaker is not active.

 For example: I have a cluster with node1 , node2, node3
 Pacemaker is running on node2  node3
 Pacemaker is not running on node1 , so any crm command returns
     Signon to CIB failed: connection failed
     Init failed, could not perform requested operations
 I'm on node1 : I want to know (by script) if Pacemaker is active
 on at least another node in the HA cluster including the node
 where I am (so node1)

 Is there a command which could give me such information offline ,
 or do I have to scan the uname fields in the recordnodes  /nodes
 and to ssh on other nodes  to get information ...

 Thanks
 Alain
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems



 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] comments required on location rules

2011-03-25 Thread Andrew Beekhof
should work.  depends what stickiness value you're using

On Sat, Mar 19, 2011 at 11:48 AM, Muhammad Sharfuddin
m.sharfud...@nds.com.pk wrote:
 we have two resource groups 'grp-SAPDatabase' and 'grp-SAPInstances'

 To better utilize our both machines, I want to run the 'grp-SAPDatabase'
 and 'grp-SAPInstances' resources on different nodes, so that we can use
 both nodes simultaneously, otherwise if both resources run on single
 node, then the other node remain idle/passive.

 I wan that grp-SAPDatabase resource must always run on node1, and
 grp-SAPDatabase failover node is 'node2'
 while  grp-SAPInstances resource must always run on node 'node2', and
 grp-SAPInstance failover node is 'node1'

 and for that I have created following location rules:

 location PrimaryLoc-of-grpSAPDatabase grp-SAPDatabase +inf: node1
 location PrimaryLoc-of-grpSAPInstances grp-SAPInstances +inf: node2

 please provide your comments/feedbacks on above rules.

 --
 Regards,
 Muhammad Sharfuddin | NDS Technologies Pvt Ltd | cell: +92-333-2144823 |
 UAN: +92-21-111-111-142 ext: 113

 The London Stock Exchange moves to SUSE Linux
 http://www.computerworlduk.com/news/open-source/3260727/london-stock-exchange-in-historic-linux-go-live/
 http://www.zdnet.com/blog/open-source/the-london-stock-exchange-moves-to-novell-linux/8285

 Your Linux is Ready
 http://www.novell.com/linux

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] need help to configure the fence_ifmib for stonith

2011-03-17 Thread Andrew Beekhof
On Thu, Mar 17, 2011 at 11:25 AM, Amit Jathar amit.jat...@alepo.com wrote:
 Hi,

 I would like to try the fence_ifmib as the fencing agent.

 I can see it is present in my machine.
 [root@OEL6_VIP_1 fence]# ls /usr/sbin/fence_ifmib
 /usr/sbin/fence_ifmib

 Also, I can see some python scripts present on my machine :-
 [root@OEL6_VIP_1 fence]# pwd
 /usr/share/fence
 [root@OEL6_VIP_1 fence]# ls
 fencing.py  fencing.pyc  fencing.pyo  fencing_snmp.py  fencing_snmp.pyc  
 fencing_snmp.pyo
 [root@OEL6_VIP_1 fence]#

 Is there any chance I can configure the if_mib as the stonith agent.

yes, but only if you have pacemaker 1.1.x

 If yes, then which MIB files shall I need ?

no idea
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] new resource agents repository commit policy

2011-03-14 Thread Andrew Beekhof
On Mon, Mar 14, 2011 at 6:07 PM, Dejan Muhamedagic de...@suse.de wrote:
 Hello everybody,

 It's time to figure out how to maintain the new Resource Agents
 repository. Fabio and I already discussed this a bit in IRC.
 There are two options:

 a) everybody gets an account at github.com and commit rights,
   where everybody is all people who had commit rights to
   linux-ha.org and rgmanager agents repositories.

 b) several maintainers have commit rights and everybody else
   sends patches to a ML; then one of the maintainers does a
   review and commits the patch (or pulls it from the author's
   repository).

I suspect you want b) with maybe 6 people for redundancy.
The pull request workflow should be well suited to a project like this
and impose minimal overhead.

The ability to comment on patches in-line before merging them should
be pretty handy.


You're also welcome to put a copy at http://www.clusterlabs.org/git/
Its pretty easy to keep the two repos in sync, for example I have this
in .git/config for matahari:

[remote origin]
fetch = +refs/heads/*:refs/remotes/origin/*
url = g...@github.com:matahari/matahari.git
pushurl = g...@github.com:matahari/matahari.git
pushurl = ssh://beek...@git.fedorahosted.org/git/matahari.git

git push then sends to both locations


 Option a) incurs a bit less overhead and that's how our old
 repositories worked. Option b) gives, at least nominally, more
 control to the select group of maintainers, but also places even
 more burden on them.

 We are open for either of these.

 Cheers,

 Fabio and Dejan
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] question on Creating an Active/Passive iSCSI configuration

2011-03-14 Thread Andrew Beekhof
On Fri, Mar 11, 2011 at 7:23 PM, Randy Katz rk...@simplicityhosting.com wrote:
 On 3/11/2011 3:29 AM, Dejan Muhamedagic wrote:
 Hi,

 On Fri, Mar 11, 2011 at 01:36:25AM -0800, Randy Katz wrote:
 On 3/11/2011 12:50 AM, RaSca wrote:
 Il giorno Ven 11 Mar 2011 07:32:32 CET, Randy Katz ha scritto:
 ps - in /var/log/messages I find this:

 Mar 10 22:31:45 drbd1 lrmd: [3274]: ERROR: get_resource_meta: pclose
 failed: Interrupted system call
 Mar 10 22:31:45 drbd1 lrmd: [3274]: WARN: on_msg_get_metadata: empty
 metadata for ocf::linbit::drbd.
 Mar 10 22:31:45 drbd1 lrmadmin: [3481]: ERROR:
 lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply
 message of rmetadata with function get_ret_from_msg.
 [...]

 Hi,
 I think that the message no such resource agent is explaining what's
 the matter.
 Does the file /usr/lib/ocf/resource.d/linbit/drbd exists? Is the drbd
 file executable? Have you correctly installed the drbd packages?

 Check those things, you can try to reinstall drbd.

 Hi

 # ls -l /usr/lib/ocf/resource.d/linbit/drbd
 -rwxr-xr-x 1 root root 24523 Jun  4  2010
 /usr/lib/ocf/resource.d/linbit/drbd
 Which cluster-glue version do you run?
 Try also:

 # lrmadmin -C
 # lrmadmin -P ocf drbd
 # export OCF_ROOT=/usr/lib/ocf
 # /usr/lib/ocf/resource.d/linbit/drbd meta-data
 I am running from a source build/install as per clusterlabs.org as the
 rpm's had broken dependencies and
 would not install.

Its a good idea to report that, with details, so that it can get fixed.

 I have now blown away that CentOS (one of them)
 machine and installed openSUSE as they
 said everything was included but it seems on 11.3 not on 11.4, on 11.4
 the install is broken and so now
 running some later later versions and running into some other issues,
 will report back with findings. What
 os distro is the least of the problems to get this stuff running on? I
 just want to get it running, run a few tests,
 and then figure out where to go from there.

 Thanks,
 Randy
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] GFS2 mounting hangs

2011-03-11 Thread Andrew Beekhof
On Thu, Mar 10, 2011 at 5:53 PM, Jonathan Schaeffer
jonathan.schaef...@univ-brest.fr wrote:
 Hi,

 I'm trying to setup a pacemaker cluster based on DRBD Active/Active and
 GFS2.

 Everything is working fine on normal startup. But when I try to mess
 around with the cluster, I come across unrecoverable problems with the
 GFS2 partition mounting.

 Here is what I did and what happens :

  - Remove the network link between the two nodes.
  - Show how the cluster behaves for a while
  - Get the network interface up again
  - As one machine whas stonithed by the other (meatware for the tests),
 I restarted the node.

Did you run the meatware confirmation command too?

  - on reboot, the cluste can't get the FileSystem resource up and hit
 timeout.

 This is what I did to show details of the mounting operation :

    # strace /sbin/mount.gfs2 /dev/drbd0 /data -o rw
    ...
    socket(PF_FILE, SOCK_STREAM, 0)         = 3
    connect(3, {sa_family=AF_FILE, path=@gfsc_sock}, 12) = 0
    write(3,
 \\o\\o\1\0\1\0\7\0\0\0\0\0\0\0`p\0\0\0\0\0\0\0\0\0\0\0\0\0\0...,
 28768) = 28768
    read(3,

 I suspect there is a problem with the DLM holding one more lock than
 necessary. The GFS partition was created with 2 journals (and has to run
 on 2 nodes).

 Does someone rely on such setup for a prodution use ?
 Realy ?
 If so, can you help me debug my problem ? The pacemaker config is pretty
 much as in the docs (DRBD+GFS2). In case it matters, the config is shown
 below.

 Thank you !

 node orque \
        attributes standby=false
 node orque2 \
        attributes standby=off
 primitive drbd-data ocf:linbit:drbd \
        params drbd_resource=orque-raid \
        op start interval=0 timeout=240s start-delay=5s \
        op stop interval=0 timeout=100s \
        op monitor interval=30s timeout=30s start-delay=5s
 primitive dlm ocf:pacemaker:controld \
        op monitor interval=120s \
        op start interval=0 timeout=90s \
        op stop interval=0 timeout=100s
 primitive gfs-control ocf:pacemaker:controld \
        params daemon=gfs_controld.pcmk args=-g 0 \
        op monitor interval=120s \
        op start interval=0 timeout=90s \
        op stop interval=0 timeout=100s
 primitive orque-fs ocf:heartbeat:Filesystem \
        params device=/dev/drbd/by-res/orque-raid directory=/data
 fstype=gfs2 \
        op start interval=0 timeout=60s \
        op stop interval=0 timeout=60s
 primitive kvm-adonga ocf:heartbeat:VirtualDomain \
        params config=/etc/libvirt/qemu/adonga.xml
 hypervisor=qemu:///system migration_transport=ssh \
        meta allow-migrate=true target-role=Started is-managed=true \
        op start interval=0 timeout=200s \
        op stop interval=0 timeout=200s \
        op monitor interval=10 timeout=200s on-fail=restart depth=0
 primitive kvm-observatoire-test ocf:heartbeat:VirtualDomain \
        params config=/etc/libvirt/qemu/observatoire-test.xml
 hypervisor=qemu:///system migration_transport=ssh \
        meta allow-migrate=true target-role=Started is-managed=true \
        op start interval=0 timeout=200s \
        op stop interval=0 timeout=200s \
        op monitor interval=10 timeout=200s on-fail=restart depth=0
 primitive kvm-testVM ocf:heartbeat:VirtualDomain \
        params config=/etc/libvirt/qemu/testVM.xml
 hypervisor=qemu:///system migration_transport=ssh \
        meta allow-migrate=true target-role=Stopped is-managed=true \
        op start interval=0 timeout=200s \
        op stop interval=0 timeout=200s \
        op monitor interval=10 timeout=200s on-fail=restart depth=0
 primitive orque-fencing stonith:meatware \
        params hostlist=orque \
        meta is-managed=true
 primitive orque2-fencing stonith:meatware \
        params hostlist=orque2 \
        meta is-managed=true target-role=Started
 ms drbd-data-clone drbd-data \
        meta master-max=2 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true
 clone dlm-clone dlm \
        meta interleave=true target-role=Started
 clone gfs-clone gfs-control \
        meta interleave=true target-role=Started
 clone orque-fs-clone orque-fs \
        meta is-managed=true target-role=Started interleave=true
 ordered=true
 location kvm-testVM-prefers-orque kvm-testVM 50: orque
 location loc-orque-fencing orque-fencing -inf: orque
 location loc-orque2-fencing orque2-fencing -inf: orque2
 colocation gfs-with-dlm inf: gfs-clone dlm-clone
 colocation kvm-adonga-with-orque-fs inf: kvm-adonga orque-fs-clone
 colocation kvm-observatoire-test-with-orque-fs inf:
 kvm-observatoire-test orque-fs-clone
 colocation kvm-testVM-with-orque-fs inf: kvm-testVM orque-fs-clone
 colocation orque-fs-with-gfs-control inf: orque-fs-clone gfs-clone
 order gfs-after-dlm inf: dlm-clone gfs-clone
 order kvm-adonga-after-orque-fs inf: orque-fs-clone kvm-adonga
 order kvm-observatoire-test-after-orque-fs inf: orque-fs-clone
 kvm-observatoire-test
 order kvm-testVM-after-orque-fs inf: orque-fs-clone kvm-testVM
 order orque-fs-after-drbd-data inf: 

Re: [Linux-HA] Active/active cluster with connectivity check

2011-03-11 Thread Andrew Beekhof
On Thu, Mar 10, 2011 at 3:54 PM, Artur linu...@netdirect.fr wrote:
 Hello,

 I'm currently switching to Heartbeat (3.0.3) and Pacemaker (1.0.9.1) on
 Debian Squeeze with CRM/CIB setup.
 This is the first time i try to configure it so please be kind with a
 newbie. :)

 I would like to setup an active/active cluster with 2 nodes, with sticky
 resources and a connectivity check.
 At this time i have a basic working solution without connectivity check.
 The configuration follows at the bottom of the email.
 But i am unable to configure the connectivity check with the pingd resource.

 The (basic) configuration explained :
 - on node p01 there are 2 sticky resources called WWW1 (apache web
 server) and VIP1 (virtual IP)
 - on node p17 there is 1 sticky resource called VIP2 (virtual IP)

 In my mind this is how it should work :
 - if a node is down its sticky resources go to the other node (this works)
 - if a node has no connectivity the resources go to the other node
 (unable to make it work)
 - if a down node goes up, the sticky resources are migrated back on it
 (this works in the current setup with no connectivity check)

 I created a pingd primitive and cloned it as explained in tutorials and
 tried some rules but with no success.

 I tried to add the following rules with no success :

 location www1-on-p01-connected WWW1 \
    rule $id=www1-on-p01-connected-rule pingd: defined pingd \
    rule $id=www1-on-p01-connected-rule-0 -1000: not_defined pingd or
 pingd lte 0 \

change -1000 to -INFINITY

    rule $id=www1-on-p01-connected-rule-1 1000: #uname eq p01

 location vip2-on-p17-connected VIP2 \
    rule $id=vip2-on-p17-connected-rule pingd: defined pingd \
    rule $id=vip2-on-p17-connected-rule-0 -1000: not_defined pingd or
 pingd lte 0 \

here too

    rule $id=vip2-on-p17-connected-rule-1 1000: #uname eq p17

 This is the current setup without active connectivity check but with
 cloned pingd primitive :

 node $id=52dadc12-ada6-46a0-8474-639a62dfa3ad p17
 node $id=6d96beed-abd9-4ad1-9a92-b6560abc0475 p01
 primitive VIP1 ocf:heartbeat:IPaddr2 \
        params ip=192.168.1.201 cidr_netmask=32 iflabel=vip1 \
        op monitor interval=30s
 primitive VIP2 ocf:heartbeat:IPaddr2 \
        params ip=192.168.1.202 cidr_netmask=32 iflabel=vip2 \
        op monitor interval=30s
 primitive WWW1 ocf:heartbeat:apache \
        params configfile=/etc/apache2/apache2.conf
 primitive pingd ocf:pacemaker:pingd \
        params host_list=192.168.1.254 multiplier=1000 \
        op monitor interval=15s timeout=5s
 clone pingdclone pingd \
        meta globally-unique=false
 location vip2-on-p17 VIP2 250: p17
 location www1-on-p01 WWW1 250: p01
 colocation www1-with-vip1 inf: WWW1 VIP1
 order www1-after-vip1 inf: VIP1 WWW1
 property $id=cib-bootstrap-options \
        stonith-enabled=false \
        no-quorum-policy=ignore \
        dc-version=1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b \
        cluster-infrastructure=Heartbeat

 Any ideas about how to make it work ?

 --

 Best regards,

 Artur.

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] resource not restarted due to score value

2011-03-11 Thread Andrew Beekhof
Happily this appears to be fixed in 1.1.5 (which I believe should be
available for SLES soon).

On Mon, Feb 7, 2011 at 9:17 AM, Haussecker, Armin
armin.haussec...@ts.fujitsu.com wrote:
 Hi,

 we have sles11 sp1 with pacemaker 1.1.2-0.7.1 and corosync 1.2.6-0.2.2.

 Attached please find
 cibadmin -Ql before stopping StorGr1 on node goat1 (diag.before)
 cibadmin -Ql after stopping StorGr1 on node goat1  (diag.after)
 crm_mon      after stopping StorGr1 on node goat1  (diag.crm_mon)
 ptest -sL    after stopping StorGr1 on node goat1  (diag.ptest)

 Regards,
 Armin Haussecker


 -Original Message-
 From: linux-ha-boun...@lists.linux-ha.org 
 [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Andrew Beekhof
 Sent: Monday, February 07, 2011 8:46 AM
 To: General Linux-HA mailing list
 Subject: Re: [Linux-HA] resource not restarted due to score value

 On Fri, Feb 4, 2011 at 12:02 PM, Haussecker, Armin
 armin.haussec...@ts.fujitsu.com wrote:
 Hi,

 in our 2-node-cluster we have a clone resource StorGr1 and two primitive 
 resources
 DummyVM1 and DummyVM2.
 StorGr1 should be started before DummyVM1 and DummyVM2 due to order 
 constraints.
 StorGr1 clone was started on both cluster nodes goat1 and sheep1.
 DummyVM1 and DummyVM2 were both started on node goat1.

 Then we stopped StorGr1 on node goat1. We expected a restart of DummyVM1 and
 DummyVM2 on the second node sheep1 due to the order constraints.
 But only DummyVM2 was restarted on the second node sheep1.
 DummyVM1 was stopped and remained in the stopped state:

 Clone Set: StorGr1-clone [StorGr1]
     Started: [ sheep1 ]
     Stopped: [ StorGr1:1 ]
 DummyVM1        (ocf::pacemaker:Dummy): Stopped
 DummyVM2        (ocf::pacemaker:Dummy): Started sheep1

 Difference: DummyVM1 has a higher allocation score value for goat1 and
 DummyVM2 has a higher allocation score value for sheep1.

 How can we achieve a restart of the primitive resources independently of the
 allocation score value ?
 Do we need other or additional constraints ?

 Shouldn't need to.
 Please attach the result of cibadmin -Ql when the cluster is in this state.

 Also some indication of what version you're running would be helpful.


 Best regards,
 Armin Haussecker

 Extract from CIB:
 primitive DummyVM1 ocf:pacemaker:Dummy \
        op monitor interval=60s timeout=60s \
        op start on-fail=restart interval=0 \
        op stop on-fail=ignore interval=0 \
        meta is-managed=true resource-stickiness=1000 
 migration-threshold=2
 primitive DummyVM2 ocf:pacemaker:Dummy \
        op monitor interval=60s timeout=60s \
        op start on-fail=restart interval=0 \
        op stop on-fail=ignore interval=0 \
        meta is-managed=true resource-stickiness=1000 
 migration-threshold=2
 primitive StorGr1 ocf:heartbeat:Dummy \
        op monitor on-fail=restart interval=60s \
        op start on-fail=restart interval=0 \
        op stop on-fail=ignore interval=0 \
        meta is-managed=true resource-stickiness=1000 
 migration-threshold=2
 clone StorGr1-clone StorGr1 \
        meta target-role=Started interleave=true ordered=true

 location score-DummyVM1 DummyVM1 400: goat1
 location score-DummyVM2 DummyVM2 400: sheep1

 order start-DummyVM1-after-StorGr1-clone inf: StorGr1-clone DummyVM1
 order start-DummyVM2-after-StorGr1-clone inf: StorGr1-clone DummyVM2









 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Not able to stop services in group individually.

2011-03-09 Thread Andrew Beekhof
On Wed, Mar 2, 2011 at 9:31 AM, Caspar Smit c.s...@truebit.nl wrote:
 Hi,

 I have the following (simple) configuration:

 primitive iscsi0 ocf:heartbeat:iscsi \
 params portal=*172.20.250.5 *target=iqn.2010-10.nl.nas:nas.storage0
 primitive iscsi1 ocf:heartbeat:iscsi \
 params portal=*172.20.250.21 *target=iqn.2010-10.nl.nas:nas.storage1

 primitive failover-ip0 ocf:heartbeat:IPaddr2 \
 params ip=172.20.60.13 iflabel=0

 primitive lvm0 ocf:heartbeat:LVM \
 params volgrpname=vg0 exclusive=yes

 primitive filesystem0 ocf:heartbeat:Filesystem \
 params device=/dev/vg0/lv0 directory=*/mnt/storage* fstype=xfs

 primitive filesystem1 ocf:heartbeat:Filesystem \
 params device=/dev/vg0/lv1 directory=*/mnt/storage2* fstype=xfs

 primitive nfs-server lsb:nfs-kernel-server

 primitive samba-server lsb:samba

 group nfs-and-samba-group iscsi0 iscsi1 failover-ip0 lvm0 filesystem0
 filesystem1 nfs-server samba-server

 location nfs-and-samba-group-prefer-node01 nfs-and-samba-group 100: node01


 So two iscsi initiators, then LVM on top of those, two filesystems (one for
 nfs exports and one for a samba share).

 What I noticed is that when I want to only stop the nfs-server (for doing
 some maintenance for instance) the samba-server is stopped also (because it
 is in a group and the order in the group seems like every primitive is
 required for the next primitive)

Right, thats how groups are supposed to work.

 (pacemaker reads the group from left to
 right)

 How would I be able to stop nfs-server and/or samba-server without
 interupting anything else in the group?

set is-managed=false for the group perhaps?


 Should I split those two from the group? But then I would need more
 constraints telling that the nfs-server and samba server can only start at
 the node were the iscsi initiator/LVM is up.
 And when I want to migrate the nfs-server to the other node, the
 samba-server and iscsi/LVM need to migrate also because of the large vg0?

 Can anyone tell me how to accomplish this?

 Thanks you very much in advance,

 Caspar Smit
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Load CRM-Konfiguration from file

2011-03-09 Thread Andrew Beekhof
try:

   crm configure  filename

On Wed, Mar 9, 2011 at 4:34 PM, Stallmann, Andreas astallm...@conet.de wrote:
 Hi there,

 is it possible to exchange a complete CIB with an other CIB?

 The background is, that we have to roll out the same cluster in different 
 customer enviroments with different IPs / networks.
 Instead of manipulating the CIB by hand via CRM, I'd rather replace 
 placeholders in a template cib via a script.

 I tried crm -f filename and crm filename to no avail. crm then commits 
 the changes line-by-line imediately, which can lead to undesireable 
 sideeffects (because some primitives start at once, where I acutally wanted 
 the group to start instead etc.). Can one force crm in a batch mode, where 
 commit happens only when I want it to happen?

 Could I instead exchange the CIB-XML-file? If yes, which prerequisites do I 
 have to take care of (I guess the cluster should be stopped, including 
 corosync, right?)? An how do I generate a CIB-file without the 
 status-information of a running system?  Would you be so kind to point me to 
 the right source of information (yes, that's a request for a RTFM *grin*).

 Thanks in advance,

 Andreas

 --
 CONET Solutions GmbH
 Andreas Stallmann,
 Theodor-Heuss-Allee 19, 53773 Hennef
 Tel.: +49 2242 939-677, Fax: +49 2242 939-393
 Mobil: +49 172 2455051
 Internet: http://www.conet.de, mailto: 
 astallm...@conet.demailto:astallm...@conet.de

 
 CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
 Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
 Gesch?ftsf?hrer/Managing Directors: J?rgen Zender (Sprecher/Chairman), Anke 
 H?fer
 Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Hans 
 J?rgen Niemeier

 CONET Technologies AG, Theodor-Heuss-Allee 19, 53773 Hennef.
 Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 10328 )
 Vorstand/Member of the Managementboard: R?diger Zeyen (Sprecher/Chairman), 
 Wilfried P?tz
 Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Dr. Gerd 
 Jakob
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Ping resource goes down and never comes up

2011-03-09 Thread Andrew Beekhof
On Thu, Feb 17, 2011 at 10:00 PM, RaSca ra...@miamammausalinux.org wrote:
 Hi all,
 is it possible that a ping_clone goes down on a node because there is no
 connectivity and never comes up again when the connectivity returns?
 The ping and clone resource is declared like this:

 primitive ping ocf:pacemaker:ping \
        params host_list=192.168.100.1 name=ping \
        op monitor interval=10s timeout=60s \
        op start interval=0 timeout=60s \
        op stop interval=0 timeout=60s
 clone ping_clone ping meta globally-unique=false

 I had to force a cleanup on this resource to make it up again. Also if
 there are some resources connected by a location like this:

 location tomcat_on_connected_node tomcat_clone \
        rule $id=tomcat_on_connected_node-rule -inf: not_defined ping or ping
 lte 0

 to the ping status those went down when the ping dies and obviously
 never comes up again when the connection returns.

we're not running ping as a daemon, its spawned every time monitor is called.
hard to say much without logs


 Are there some parameters to force the cleanup or to manage these kind
 of situations?

 Thanks a lot!

 --
 RaSca
 Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
 ra...@miamammausalinux.org
 http://www.miamammausalinux.org

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] after configuring dlm resource, pacemaker on cman stack fails

2011-03-09 Thread Andrew Beekhof
Check the correct daemon is being started (looks like its still
starting the pacemaker specific one).
Check what happens when you start the daemon manually.

On Fri, Feb 25, 2011 at 9:10 AM, Pieter Baele pieter.ba...@gmail.com wrote:
 I try to get clvmd (+cmirror) running on top of pacemaker - cman.

 After the initial setup, I defined a dlm resource
 primitive dlm ocf:pacemaker:controld op monitor interval=60 timeout=60

 Maybe this is the wrong way or I missed a step?
 What else is required?

 (no instructions in previous [Pacemaker] CMAN integration questions)


 Feb 25 08:55:45 node01 crmd: [5859]: info: update_dc: Unset DC node02
 Feb 25 08:55:45 node01 crmd: [5859]: info: do_state_transition: State
 transition S_NOT_DC - S_PENDING [ input=I_PENDING
 cause=C_FSA_INTERNAL origin=do_election_count_vote ]
 Feb 25 08:55:45 node01 crmd: [5859]: info: update_dc: Set DC to 1node02 
 (3.0.2)
 Feb 25 08:55:45 node01 cib: [5855]: info: write_cib_contents: Archived
 previous version as /var/lib/heartbeat/crm/cib-38.raw
 Feb 25 08:55:45 node01 cib: [5855]: info: write_cib_contents: Wrote
 version 0.24.0 of the CIB to disk (digest:
 4bcc6b0560ade75509e811d2cb89e3fa)
 Feb 25 08:55:45 node01 cib: [5855]: info: retrieveCib: Reading cluster
 configuration from: /var/lib/heartbeat/crm/cib.dZXxI6 (digest:
 /var/lib/heartbeat/crm/cib.KnMxRG)
 Feb 25 08:55:45 node01 crmd: [5859]: info: do_state_transition: State
 transition S_PENDING - S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE
 origin=do_cl_join_finalize_respond ]
 Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_local_callback:
 Sending full refresh (origin=crmd)
 Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_trigger_update:
 Sending flush op to all hosts for: shutdown (null)
 Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_trigger_update:
 Sending flush op to all hosts for: fail-count-dlm (null)
 Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_trigger_update:
 Sending flush op to all hosts for: fail-count-ClusteredIP (null)
 Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_trigger_update:
 Sending flush op to all hosts for: terminate (null)
 Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_trigger_update:
 Sending flush op to all hosts for: last-failure-dlm (null)
 Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_trigger_update:
 Sending flush op to all hosts for: last-failure-ClusteredIP (null)
 Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_trigger_update:
 Sending flush op to all hosts for: probe_complete (true)
 Feb 25 08:55:46 node01 crmd: [5859]: info: do_lrm_rsc_op: Performing
 key=9:13:0:fb5708b9-4afe-41d3-b3f1-b9a47a7f29c6 op=dlm_start_0 )
 Feb 25 08:55:46 node01 lrmd: [5856]: info: rsc:dlm:6: start
 Feb 25 08:55:46 node01 lrmd: [5856]: info: RA output:
 (dlm:start:stderr) dlm_controld.pcmk: no process killed
 Feb 25 08:55:46 node01 cluster-dlm: [9561]: info: get_cluster_type:
 Cluster type is: 'cman'.
 Feb 25 08:55:46 node01 cluster-dlm: [9561]: info: get_local_node_name:
 Using CMAN node name: node01
 Feb 25 08:55:46 node01 cluster-dlm: [9561]: info:
 init_ais_connection_once: Connection to 'cman': established
 Feb 25 08:55:46 node01 cluster-dlm: [9561]: info: crm_new_peer: Node
 node01 now has id: 16847020
 Feb 25 08:55:46 node01 cluster-dlm: [9561]: info: crm_new_peer: Node
 16847020 is now known as node01
 Feb 25 08:55:46 node01 cluster-dlm: [9561]: ERROR: crm_abort:
 send_ais_text: Forked child 9565 to record non-fatal assert at
 ais.c:345 : dest != crm_msg_ais
 Feb 25 08:55:46 node01 cluster-dlm: [9561]: ERROR: send_ais_text:
 Sending message 0 via cpg: FAILED (rc=22): Message error: Success (0)
 Feb 25 08:55:46 node01 cluster-dlm: [9561]: ERROR: crm_abort:
 send_ais_text: Forked child 9566 to record non-fatal assert at
 ais.c:345 : dest != crm_msg_ais
 Feb 25 08:55:46 node01 cluster-dlm: [9561]: ERROR: send_ais_text:
 Sending message 1 via cpg: FAILED (rc=22): Message error: Success (0)
 Feb 25 08:55:46 node01 dlm_controld.pcmk: [9561]: notice:
 terminate_ais_connection: Disconnecting from AIS
 Feb 25 08:55:47 node01 lrmd: [5856]: info: RA output:
 (dlm:start:stderr) dlm_controld.pcmk: no process killed
 Feb 25 08:55:47 node01 crmd: [5859]: info: process_lrm_event: LRM
 operation dlm_start_0 (call=6, rc=7, cib-update=17, confirmed=true)
 not running
 Feb 25 08:55:47 node01 attrd: [5857]: info: attrd_ais_dispatch: Update
 relayed from node02
 Feb 25 08:55:47 node01 attrd: [5857]: info: attrd_trigger_update:
 Sending flush op to all hosts for: fail-count-dlm (INFINITY)
 Feb 25 08:55:47 node01 crmd: [5859]: info: do_lrm_rsc_op: Performing
 key=2:14:0:fb5708b9-4afe-41d3-b3f1-b9a47a7f29c6 op=dlm_stop_0 )
 Feb 25 08:55:47 node01 lrmd: [5856]: info: rsc:dlm:7: stop
 Feb 25 08:55:47 node01 attrd: [5857]: info: attrd_perform_update: Sent
 update 81: fail-count-dlm=INFINITY
 Feb 25 08:55:47 node01 attrd: [5857]: info: attrd_ais_dispatch: Update
 relayed from node02
 Feb 25 08:55:47 node01 attrd: [5857]: info: attrd_trigger_update:
 Sending 

Re: [Linux-HA] Looking for a suitable Stonith Solution

2011-03-03 Thread Andrew Beekhof
On Wed, Mar 2, 2011 at 9:05 AM, Stallmann, Andreas astallm...@conet.de wrote:
 Hi Andrew,

 If suicide is no supported fencing option, why is it still included with 
 stonith?
 Left over from heartbeat v1 days I guess.
 Could also be a testing-only device like ssh.

 www.clusterlabs.org tells me, you're the Pacemaker project leader.

Yes, but the stonith devices come from cluster-glue.
So I guess Dejan or Florian are nominally in charge of those, but
they've not been changed in forever.

 Would you, by chance, know who maintains or maintained the 
 suicide-stonith-plugin? It maybe testing-only, yes. But at least, ssh is 
 working as intended.

 It's badly documented, and I didn't find a single (official) document
 on howto implement a (stable!) suicide-stonith,
 Because you can't.  Suicide is not, will not, can not be reliable.
 Yes, you're right. But under certain circumstances (1. nodes are still alive, 
 2. both redundant communication channels [networks] are down, 3. policy 
 requires no node to be up, which has no quorum) it might be a good addition 
 to a regular stonith (because if [2] happens, pacemaker/stonith will 
 probably not be able to control a network power switch etc.) Could we agree 
 on that?

Sure. But even if you have a functioning suicide plugin, Pacemaker
cannot ever make decisions that assume it worked.
Because for all it knows the other side might consider itself to be
perfectly healthy.

 If not: What's your recommended setup for (resp. against) such situations? 
 Think of split sites here!

You still need reliable fencing, if you cant provide that, there needs
to be a human in the loop.

 The whole point of stonith is to create a known node state (off) in 
 situations where you cannot be sure if your peer is alive, dead  or some 
 state in-between.
 Yes, so don't file suicide under stonith! We implemented a different 
 approach in a two node cluster: We wrote a script that checks (by means of 
 cron) the connectivity (by means of ping) to the peer (if connected, 
 everything fine) and then (if peer are not reachable) to some quorum nodes. 
 If either the peer or a majority of the quorum nodes are alive, nothing 
 happens. If quorum is lost, the node shut's itself down.

Wonderful, but the healthy side still can't do anything, because it
can't know that the bad side is down.
So what have you gained over no-quorum-policy=stop (which is the default) ?


 We did that, because drbd tended to misbehave in situations, where all 
 network connectivity was lost. We'd rather have a clean shutdown on both 
 sides, than a corrupt filesystem. I always consider this solution as 
 unelegant, mainly because it wasn't controllable via crm. Thus I hoped, I 
 could forget this solution when using pacemaker. It seems, I can not.

 If there's any interest from the community in our suicide by cron-solution, 
 tell me if and how to contribute.

 It requires a sick node to suddenly start functioning correctly - so 
 attempting to self-terminate makes some sense, relying on it to succeed does 
 not seem prudent.

 Ys! But it's not always the node, that's sick. Sometimes (even with the 
 best and most redundant network), the connectivity between the node ist the 
 problem, not a marauding pacemaker or openais! Again: Please tell me, what's 
 your solution in that case?

Again, tell me how the other side is supposed to know and what you gain?


 On the other hand, it doen't make any other sense to name a 
 no-quorum-policy suicide, if it's anything, but a suicide (if, at all, 
 one could name it assisted suicide).

 This question is still unanswered. Does no quorum-policy suicide really 
 have a meaning?

yes, for N  2, it is a faster version of stop

 Or is it as well a leftover from the times of heartbeat.

no

 Is it still functional?

yes
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Looking for a suitable Stonith Solution

2011-02-28 Thread Andrew Beekhof
On Fri, Feb 25, 2011 at 12:51 PM, Stallmann, Andreas
astallm...@conet.de wrote:
 Hi!

 I conentrate both your answers into one mail, I hope that's allright for you.

 For now, I need an interim solution, which is, as of now, stonith via 
 suicide.
 Doesn't work as suicide is not considered reliable - by definition the 
 remaining nodes have no way to verify that the fencing operation was 
 successful.
 Suspect it will still fail though, suicide isnt a supported fencing option - 
 since obviously the other nodes can't confirm it happened.

 Ok then, I know I'm a little bit provocative right now:

 If suicide is no supported fencing option, why is it still included with 
 stonith?

Left over from heartbeat v1 days I guess.
Could also be a testing-only device like ssh.

 It's badly documented, and I didn't find a single (official) document on 
 howto implement a (stable!) suicide-stonith,

Because you can't.  Suicide is not, will not, can not be reliable.
The whole point of stonith is to create a known node state (off) in
situations where you cannot be sure if your peer is alive, dead or
some state in-between.

Suicide does not achieve this in any way, shape or form.
It requires a sick node to suddenly start functioning correctly - so
attempting to self-terminate makes some sense, relying on it to
succeed does not seem prudent.

 but it's there, and thus it should be usable. If it isn't, the maintainer 
 should please (please!) remove it or supply something that's working. I do 
 know, that's quite demanding, because the maintainer will probably do the 
 development in his (or her) free time. Still...

 I do as well agree, that suicide is a very special way of keeping a cluster 
 consistent, very different from the other stonith methods. I wouldn't expect 
 it under stonith, I'd rather think...

 Yes no-quorum-policy=suicide means that all nodes in the partition will end 
 up being shot, but you still require a real stonith device so that 
 _someone_else_ can perform it.
 ...that if you set no-quorum-policy=suicide, the suicide script is executed 
 by the node itself. It should be an *extra* feature *besides* stonith. The 
 procedure should be something like:

 1) node1: Allright, I have no quorum anymore. Let's wait for a while...
 2)... a while passes
 3) node1: OK, I'm still without quorum, no contact to my peers, whatsoever. 
 I'd rather shut myself down, before I cause a mess.

 If, during (2), the other nodes find a way to shut down the node externaly 
 (if through ssh, a power switch, a virtualisation host...), that's even 
 better, because then the cluster knows, that it's still consistent. I'm 
 with you, here.

 If a split brain happens in a split site scenario, a suicide might be the 
 only way to keep up consistency,  because no one will be able to reach any 
 device on the other site... Please correct me if I'm wrong. What do you do in 
 such a case? What's your exemplary implementation of Linux-HA then?

 On the other hand, it doen't make any other sense to name a 
 no-quorum-policy suicide, if it's anything, but a suicide (if, at all, 
 one could name it assisted suicide).

 Please correct me: Do I have a utterly wrong understanding of the whole 
 process (that could be very well the case), is the implementation not 
 entirely thought through, or is the naming of certain components not as good 
 as it could be?

 I might point you to 
 http://osdir.com/ml/linux.highavailability.devel/2007-11/msg00026.html, 
 because the same thing has been discussed then, and I very much do think, 
 that Lars was right with what he wrote. Has anything changed in the concept 
 of suicide/quorum-loss/stonith since then? That's not a provocative question, 
 well, maybe it is, but it's not meant to be.

 In addition: Something that's missing from the manuals is a case study (or 
 something the like) on how to implement a split side scenario. How should the 
 cluster be build then? If you have to sides? If you have one? How should the 
 storage-replication be set up? Is synchronous replication like in drbd really 
 a good idea then, performance wise? I think I'll finally have to buy a book. 
 :-) Any recommendations (either english or german prefered).

 Well, thank's a lot again, my brain didn't explode (that's something good, I 
 feel), but I'm not entirely happy, though.

 Cheers and have a nice weekend,

 Andreas


 
 CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
 Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
 Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke 
 Höfer
 Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Hans 
 Jürgen Niemeier

 CONET Technologies AG, Theodor-Heuss-Allee 19, 53773 Hennef.
 Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 10328 )
 Vorstand/Member of the Managementboard: Rüdiger Zeyen (Sprecher/Chairman), 
 Wilfried Pütz
 Vorsitzender des Aufsichtsrates/Chairman 

Re: [Linux-ha-dev] new resource agents repository

2011-02-24 Thread Andrew Beekhof
On Thu, Feb 24, 2011 at 4:10 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 On Thu, Feb 24, 2011 at 03:56:27PM +0100, Andrew Beekhof wrote:
 On Thu, Feb 24, 2011 at 2:59 PM, Dejan Muhamedagic deja...@fastmail.fm 
 wrote:
  Hello,
 
  There is a new repository for Resource Agents which contains RA
  sets from both Linux HA and Red Hat projects:
 
         git://github.com/ClusterLabs/resource-agents.git
 
  The purpose of the common repository is to share maintenance load
  and try to consolidate resource agents.
 
  There were no conflicts with the rgmanager RA set and both source
  layouts remain the same. It is only that autoconf bits were
  merged. The only difference is that if you want to get Linux HA
  set of resource agents installed, configure should be run like
  this:
 
         configure --with-ras-set=linux-ha ...
 
  The new repository is git but the existing history is preserved.
  People used to Mercurial shouldn't have hard time working with
  git.
 
  We need to retire the existing repository hg.linux-ha.org. Are
  there any objections or concerns that still need to be addressed?

 Might not hurt to leave it around - there might be various URLs that
 point there.

 Yes, it will definitely remain there. What I meant with retire,
 is that the developers then start using the git repository
 exclusively.

A

Yes, and making read-only on the server it probably a good idea (to
avoid pushes).
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] Problems starting apache

2011-02-24 Thread Andrew Beekhof
On Thu, Feb 24, 2011 at 3:31 PM, Dan Frincu df.clus...@gmail.com wrote:
 Hi,

 On 02/24/2011 04:24 PM, Stallmann, Andreas wrote:
 Hi!

 First: I set up my configuration anew, and it works. I didn't change that 
 much, just set the monitor-action differently from before.
 Instead of:

 webserver_ressource ocf:heartbeat:apache \
          params httpd=/usr/sbin/httpd2-prefork \
          op start interval=0 timeout=40s \
          op stop interval=0 timeout=60s \
          op monitor interval=10 timeout=20s depth=0 \
          meta target-role=Started
 I have now:

 primitive web_res ocf:heartbeat:apache \
          params configfile=/etc/apache2/httpd.conf \
          params httpd=/usr/sbin/httpd2-prefork \
          op start interval=0 timeout=40s \
          op stop interval=0 timeout=60s \
          op monitor interval=1min

 As you can see, I added the configfile.

 This obviously did it, because when I gave the logs a closer look I found:

 Feb 24 10:43:22 mgmt-01 apache[1191]: ERROR: httpd2-prefork: option requires 
 an argument -- f

 Somehow the ocf:heartbeat:apache did not supply the default-configfile. 
 Thus, you have to supply the configfile. Well...

 The answer should be in the logs.
 Your right, that's where it was. I somehow got lost in the vast amount of 
 logs...

 grep -i error | grep -i apache

 That did it.

 Is it, by the way possible, to influence the logging in any way? Make it 
 more verbose, or redirect the logs to a different file (without using 
 filtering in syslog)?
 If you're using corosync, then in /etc/corosync/corosync.conf.

 logging {
         debug: off
         fileline: off
         to_syslog: yes
         to_stderr: no
         syslog_facility: local7
         timestamp: on
                to_logfile: yes
                logfile: /var/log/cluster/corosync.log
                logger_subsys {
                        subsys: AMF
                        debug: off
                }
 }

 And in syslog.conf you put local7.*
 /var/log/cluster/corosync.log

 And it logs to a file. For anything else other than corosync, someone
 else can reply.

This will tell Pacemaker to also the same logging setup, so no
additional steps there.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Looking for a suitable Stonith Solution

2011-02-24 Thread Andrew Beekhof
On Thu, Feb 24, 2011 at 2:49 PM, Stallmann, Andreas astallm...@conet.de wrote:
 Hi!

 TNX for your answer. We will switch to sbd after the shared storage has been 
 set up.

 For now, I need an interim solution, which is, as of now, stonith via suicide.

Doesn't work as suicide is not considered reliable - by definition the
remaining nodes have no way to verify that the fencing operation was
successful.

Yes no-quorum-policy=suicide means that all nodes in the partition
will end up being shot, but you still require a real stonith device so
that _someone_else_ can perform it.


 My configuration doesn't work, though.

 I tried:

 ~~Output from crm configure show~~
 primitive suicide_res stonith:suicide
 ...
 clone fenc_clon suicide_res
 ...
 property $id=cib-bootstrap-options \
        dc-version=1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10 \
        cluster-infrastructure=openais \
        expected-quorum-votes=3 \
        stonith-enabled=true \
        no-quorum-policy=suicide \
        stonith-action=poweroff
 


 If I disconnect one node from the network, crm_mon shows:

 
 Current DC: mgmt03 - partition WITHOUT quorum
 ...
 Node mgmt01: UNCLEAN (offline)
 Node mgmt02: UNCLEAN (offline)
 Online: [ mgmt03 ]

 Clone Set: fenc_clon
        Started: [ ipfuie-mgmt03 ]
        Stopped: [ suicide_res:0 suicide_res:1 ]
 ~~~

 No action, neither reboot nor poweroff is taken.

 1. What did I do wrong here?
 2. OK, let's be more precise: I have the feeling, that the suicide 
 ressource should be in a default state of stopped (on all nodes) and should 
 only be started on the node, which has to fence itself. Am I right? And, if 
 yes, how is that accomplished?
 3. How does the no-quorum-policy relate to the stonith-ressources? I didn't 
 find any documentation, if the two have any connection at all.
 4. Am I correct, that the no-quorum-policy is what a node (or a cluster 
 partition) should do to itself, when it looses quorum (for example, shut down 
 itself), and stonith is what the nodes with quorum try to do to the nodes 
 without?
 5. Shouldn't then no-quorum-policy=suicide be obsolet in case of suicide as 
 stonith-method?

 TNX for your help (again),

 Andreas



 
 CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
 Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
 Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke 
 Höfer
 Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Hans 
 Jürgen Niemeier

 CONET Technologies AG, Theodor-Heuss-Allee 19, 53773 Hennef.
 Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 10328 )
 Vorstand/Member of the Managementboard: Rüdiger Zeyen (Sprecher/Chairman), 
 Wilfried Pütz
 Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Dr. Gerd 
 Jakob
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] CLVM cmirror using Pacemaker DLM integration on rhel 6

2011-02-21 Thread Andrew Beekhof
On Thu, Feb 17, 2011 at 3:34 PM, Pieter Baele pieter.ba...@gmail.com wrote:
 Hi,

 With our last cluster experiments we try to set up Pacemaker with CLVM 
 mirroring
 on RHEL 6.0

 I added a DLM resource, but when I try to add clvm in crm, I get the
 following error:
 crm(live)configure# primitive clvm ocf:lvm2:clvmd params
 daemon_timeout=30 op monitor interval=60 timeout=60
 ERROR: ocf:lvm2:clvmd: could not parse meta-data:
 ERROR: ocf:lvm2:clvmd: no such resource agent

 Any idea what's missing?

The ocf:lvm2:clvmd resource agent perhaps?


 Is there a short guide/howto somewhere how to set this up?

I don't know of one personally


 Last updated: Thu Feb 17 15:32:12 2011
 Stack: openais
 Current DC: xxx - partition with quorum
 Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
 2 Nodes configured, 2 expected votes
 2 Resources configured.
 

 Online: [ xyz xyz ]

 ClusterIP       (ocf::heartbeat:IPaddr2):       Started x
 dlm     (ocf::pacemaker:controld):      Started x
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] New ocft config file for IBM db2 resource agent

2011-02-15 Thread Andrew Beekhof
On Tue, Feb 15, 2011 at 10:50 AM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 Hi Holger,

 On Tue, Feb 15, 2011 at 09:49:07AM +0100, Holger Teutsch wrote:
 Hi,
 please find enclosed an ocft config for db2 for review and inclusion
 into the project if appropriate.

 Wonderful! This is the first time somebody contributed an ocft
 testcase.

Looks like lmb owes somebody lunch :-)

 The current 1.0.4 agent passes the tests 8-) .

 I've never doubted that either.

 Cheers,

 Dejan


 Regards
 Holger




 # db2
 #
 # This test assumes a db2 ESE instance with two partions and a database.
 # Default is instance=db2inst1, database=ocft
 # adapt this in set_testenv below
 #
 # Simple steps to generate a test environment (if you don't have one):
 #
 # A virtual machine with 1200MB RAM is sufficient
 #
 # - download an eval version of DB2 server from IBM
 # - create an user db2inst1 in group db2inst1
 #
 # As root
 # - install DB2 software in some location
 # - create instance
 #   cd this_location/instance
 #   ./db2icrt -s ese -u db2inst1 db2inst1
 # - adapt profile of db2inst1 as instructed by db2icrt
 #
 # As db2inst1
 #      # allow to run with small memory footprint
 #      db2set DB2_FCM_SETTINGS=FCM_MAXIMIZE_SET_SIZE:FALSE
 #      db2start
 #      db2start dbpartitionnum 1 add dbpartitionnum hostname $(uname -n) 
 port 1 without tablespaces
 #      db2stop
 #      db2start
 #      db2 create database ocft
 # Done
 # In order to install a real cluster refer to 
 http://www.linux-ha.org/wiki/db2_(resource_agent)

 CONFIG
         HangTimeout 40

 SETUP-AGENT
         # nothing

 CASE-BLOCK set_testenv
         Var OCFT_instance=db2inst1
         Var OCFT_db=ocft

 CASE-BLOCK crm_setting
         Var OCF_RESKEY_instance=$OCFT_instance
       Var OCF_RESKEY_CRM_meta_timeout=3

 CASE-BLOCK default_status
       AgentRun stop

 CASE-BLOCK prepare
         Include set_testenv
       Include crm_setting
       Include default_status

 CASE check base env
       Include prepare
       AgentRun start OCF_SUCCESS

 CASE check base env: invalid 'OCF_RESKEY_instance'
       Include prepare
       Var OCF_RESKEY_instance=no_such
       AgentRun start OCF_ERR_INSTALLED

 CASE invalid instance config
       Include prepare
       Bash eval mv ~$OCFT_instance/sqllib ~$OCFT_instance/sqllib-
       BashAtExit eval mv ~$OCFT_instance/sqllib- ~$OCFT_instance/sqllib
       AgentRun start OCF_ERR_INSTALLED

 CASE unimplemented command
       Include prepare
       AgentRun no_cmd OCF_ERR_UNIMPLEMENTED

 CASE normal start
       Include prepare
       AgentRun start OCF_SUCCESS

 CASE normal stop
       Include prepare
       AgentRun start
       AgentRun stop OCF_SUCCESS

 CASE double start
       Include prepare
       AgentRun start
       AgentRun start OCF_SUCCESS

 CASE double stop
       Include prepare
       AgentRun stop OCF_SUCCESS

 CASE started: monitor
       Include prepare
       AgentRun start
       AgentRun monitor OCF_SUCCESS

 CASE not started: monitor
       Include prepare
       AgentRun monitor OCF_NOT_RUNNING

 CASE killed instance: monitor
         Include prepare
         AgentRun start OCF_SUCCESS
         AgentRun monitor OCF_SUCCESS
         BashAtExit rm /tmp/ocft-helper1
         Bash echo su $OCFT_instance -c '. 
 ~$OCFT_instance/sqllib/db2profile; db2nkill 0 /dev/null 21'  
 /tmp/ocft-helper1
         Bash sh -x /tmp/ocft-helper1
         AgentRun monitor OCF_NOT_RUNNING

 CASE overload param instance by admin
         Include prepare
         Var OCF_RESKEY_instance=no_such
         Var OCF_RESKEY_admin=$OCFT_instance
         AgentRun start OCF_SUCCESS

 CASE check start really activates db
         Include prepare
         AgentRun start OCF_SUCCESS

         BashAtExit rm /tmp/ocft-helper2
         Bash echo su $OCFT_instance -c '. 
 ~$OCFT_instance/sqllib/db2profile; db2 get snapshot for database on 
 $OCFT_db/dev/null'  /tmp/ocft-helper2
         Bash sh -x /tmp/ocft-helper2

 CASE multipartion test
         Include prepare
         AgentRun start OCF_SUCCESS
         AgentRun monitor OCF_SUCCESS

         # start does not start partion 1
         Var OCF_RESKEY_dbpartitionnum=1
         AgentRun monitor OCF_NOT_RUNNING

         # now start 1
         AgentRun start OCF_SUCCESS
         AgentRun monitor OCF_SUCCESS

         # now stop 1
         AgentRun stop OCF_SUCCESS
         AgentRun monitor OCF_NOT_RUNNING

         # does not affect 0
         Var OCF_RESKEY_dbpartitionnum=0
         AgentRun monitor OCF_SUCCESS

 # fault injection does not work on the 1.0.4 client due to a hardcoded path
 CASE simulate hanging db2stop (not meaningful for 1.0.4 agent)
         Include prepare
         AgentRun start OCF_SUCCESS
         Bash [ ! -f /usr/local/bin/db2stop ]
         BashAtExit rm /usr/local/bin/db2stop
         Bash echo -e #!/bin/sh\necho fake db2stop\nsleep 1  
 /usr/local/bin/db2stop
         Bash chmod +x /usr/local/bin/db2stop
         AgentRun stop OCF_SUCCESS

 # 

Re: [Linux-ha-dev] [PATCH] manage PostgreSQL 9.0 streaming replication using Master/Slave

2011-02-14 Thread Andrew Beekhof
On Mon, Feb 14, 2011 at 8:46 PM, Serge Dubrouski serge...@gmail.com wrote:
 On Mon, Feb 14, 2011 at 1:28 AM, Takatoshi MATSUO matsuo@gmail.com 
 wrote:
 Ideally demote operation should stop a master node and then restart it
 in hot-standby mode. It's up to administrator to make sure that no
 node with outdated data gets promoted to the master role. One should
 follow standard procedures: cluster software shouldn't be configured
 for autostart at the boot time, administrator has to make sure that
 data was refreshed if the node was down for some prolonged time.

 Hmm..
 Do you mean that RA puts recovery.conf automatically at demote op to
 start hot standby?
 Please give me some time to think it over.


 Sorry, I got the wrong idea about restoring data.
 To start as hot-standby needs restoring anytime,
 because Time-line ID of PostgreSQL is incremented.
 In addition, shutting down the PostgreSQL with immediate option causes
 inconsistent WAL  between primary and hot-standby.

 So I think it's difficult to start slave automatically at demote.
 Still, do you think it's better to implement restoring ?

 I'm afraid it's not just better, but it's a must. We have to play by
 Pacemaker's rules and that means that we have to properly implement
 demote operation and that's switching from Master to Slave, not just
 stopping Master. I do appreciate your efforts, but implementation has
 to conform to Pacemaker standards, i.e. Master has to start where it's
 configured in Pacemaker, not just where recovery.conf file exists.

Thats the ideal at least.

Most of the time it should be possible to self-promote and let
pacemaker figure out the result.
But I can easily imagine there would also be situations where this is
going blow up in your face.

 Administrator has to be able to easily switch between node roles and
 so on.

 I still need some more time to learn PostgreSQL data replication and
 do some tests. Let's think if that's possible to implement real
 Master/Slave in Pacemaker sense of things.

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/




 --
 Serge Dubrouski.
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] OCF_RESKEY_CRM_meta_timeout not matching monitor timeout meta-data

2011-02-14 Thread Andrew Beekhof
On Fri, Feb 4, 2011 at 11:23 AM, Brett Delle Grazie
brett.dellegra...@gmail.com wrote:
 Hi,

 Apologies for cross-posting but I'm not sure where this problem resides.

 I'm running:
 corosync-1.2.7-1.1.el5.x86_64
 corosynclib-1.2.7-1.1.el5.x86_64
 cluster-glue-1.0.6-1.6.el5.x86_64
 cluster-glue-libs-1.0.6-1.6.el5.x86_64
 pacemaker-1.0.10-1.4.el5.x86_64
 pacemaker-libs-1.0.10-1.4.el5.x86_64
 resource-agents-1.0.3-2.6.el5.x86_64

 on RHEL5.

 In one of my resource agents (tomcat) I'm directly outputting the result of:
 $((OCF_RESKEY_CRM_meta_timeout/1000))
 to an external file.
 and its coming up with a value of '100'

 Whereas the resource definition in pacemaker specifies timeout of '30'
 specifically:

 primitive tomcat_tc1 ocf:intact:tomcat \
        params tomcat_user=tomcat catalina_home=/opt/tomcat6
 catalina_pid=/home/tomcat/tc1/temp/tomcat.pid
 catalina_rotate_log=NO script_log=/home/tomcat/tc1/logs/tc1.log
 statusurl=http://127.0.0.1/version/; java_home=/usr/lib/jvm/java \
        op start interval=0 timeout=70 \
        op stop interval=0 timeout=20 \
        op monitor interval=60 timeout=30 start-delay=70

 Is this a known bug?

No.  Could you file a bug please?

 Does it affect all operation timeouts?

Unknown


 Thanks,

 --
 Best Regards,

 Brett Delle Grazie
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] A bunch of thoughts/questions about heartbeat network(s)

2011-02-14 Thread Andrew Beekhof
On Tue, Jan 25, 2011 at 8:15 AM, Alain.Moulle alain.mou...@bull.net wrote:
 Hi

 A bunch of thoughts/questions about heartbeat network(s) :

 In the following, when I talk about two heartbeat networks , I'm
 talking about two physically different networks set in the corosync.conf
 as two different ring-number (with rrp_mode set to active).

 1/ with a 2-nodes HA cluster, it is recommended to have two heartbeat
 networks
    to avoid the race for fencing, or even the dual-fencing in case of
 problem on
    this heartbeat network.

    But with a more-than-2-nodes HA cluster, is it always worthwhile to
 have two
    heartbeat networks ? My understanding is that if one node can't have
 contact
    from other nodes in the cluster due to a heartbeat network problem,
 as it is
    isolated, it does not have quorum and so is not authorized to
 fence any other node,
    whereas other nodes have quorum and so will decide to fence the node
 with problem.
    Right ?

Right, but wouldn't it be better to have no need to shoot anyone?


    So is there any other advantage to have more than 2 heartbeats networks
    in a more-than-2-nodes HA cluster ?

 2/ if the future of the HA stack for Pacemaker is option 4 (corosync +
 cpg + cman + mcp),

Option 4 does not involve cman

 meaning
    that cluster manager configuration parameters will all be in
 cluster.conf and nothing more in
    corosync.conf (again that's my understanding...) ,

Other way around, cluster.conf is going away (like cman) not corosync.conf

 from memory there
 is any possibility
    to set two heartbeat networks in cluster.conf (Cluster Suite from RH
 was working only
    on 1 heartbeat network and if one wanted to work on 2 hearbeat
 netwoks he has to configure
    a bonding solution).

    Am I right when I write no possibility of 2 hb networks with stack
 option 4 ?

No


 Thanks a lot for your responses, and tell me if some of my understanding
 is not right ...

 Alain


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] One-Node-Cluster

2011-02-14 Thread Andrew Beekhof
On Mon, Feb 14, 2011 at 12:40 PM, Ulrich Windl
ulrich.wi...@rz.uni-regensburg.de wrote:
 Andrew Beekhof and...@beekhof.net schrieb am 14.02.2011 um 10:08 in 
 Nachricht
 aanlktinuc9_oqpwjubxrdmqkncqvnqx68a_1kbqss...@mail.gmail.com:
 [...]
  The log just keeps on saying:
  Feb  8 16:01:03 dmcs2 pengine: [1480]: WARN: cluster_status: We do not have
  quorum - fencing and resource management disabled

 Exactly.
 Read that line again a couple of times, then read clusters from scratch.
 [...]

 Which makes me wonder: Can a one-node-cluster ever have a quorum?

Not really, which is why we have no-quorum-policy.

 I think a one-node-cluster is a completely valid construct. Also with 
 Linux-HA?

Yep.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] One-Node-Cluster

2011-02-14 Thread Andrew Beekhof
On Tue, Feb 15, 2011 at 6:08 AM, Alan Robertson al...@unix.sh wrote:
 On 02/14/2011 04:45 AM, Andrew Beekhof wrote:
 On Mon, Feb 14, 2011 at 12:40 PM, Ulrich Windl
 ulrich.wi...@rz.uni-regensburg.de  wrote:
 Andrew Beekhofand...@beekhof.net  schrieb am 14.02.2011 um 10:08 in 
 Nachricht
 aanlktinuc9_oqpwjubxrdmqkncqvnqx68a_1kbqss...@mail.gmail.com:
 [...]
 The log just keeps on saying:
 Feb  8 16:01:03 dmcs2 pengine: [1480]: WARN: cluster_status: We do not 
 have
 quorum - fencing and resource management disabled
 Exactly.
 Read that line again a couple of times, then read clusters from scratch.
 [...]

 Which makes me wonder: Can a one-node-cluster ever have a quorum?
 Not really, which is why we have no-quorum-policy.

 I think a one-node-cluster is a completely valid construct. Also with 
 Linux-HA?
 Yep.

 If you're using the Heartbeat membership stack, then it is perfectly
 happy to give you quorum in a one-node cluster.

Or a two node cluster.  Which is not exactly ideal.

 In fact, at one tmie I wrote a script to create a cluster configuration
 from your /etc/init.d/ scripts - so that Pacemaker could be effectively
 a nice replacement for init - with a respawn that really works ;-)


 --
     Alan Robertsonal...@unix.sh

 Openness is the foundation and preservative of friendship...  Let me claim 
 from you at all times your undisguised opinions. - William Wilberforce

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] ocft: status vs. monitor

2011-02-13 Thread Andrew Beekhof
On Sun, Feb 13, 2011 at 11:01 AM, Holger Teutsch holger.teut...@web.de wrote:
 Hi,
 to my knowledge OCF *requires* a method monitor while status is optional
 (or what is it really for? heritage, compatibility, ...)

 Shouldn't the ocft configs check for status ?

Yes, unless its trying to talk to an LSB resource.


 -holger

 diff -r 722c8a7a03e9 tools/ocft/apache
 --- a/tools/ocft/apache Fri Feb 11 18:49:09 2011 +0100
 +++ b/tools/ocft/apache Sun Feb 13 10:57:50 2011 +0100
 @@ -52,14 +52,14 @@
        Include prepare
        AgentRun stop OCF_SUCCESS

 -CASE running status
 +CASE running monitor
        Include prepare
        AgentRun start
 -       AgentRun status OCF_SUCCESS
 +       AgentRun monitor OCF_SUCCESS

 -CASE not running status
 +CASE not running monitor
        Include prepare
 -       AgentRun status OCF_NOT_RUNNING
 +       AgentRun monitor OCF_NOT_RUNNING

  CASE unimplemented command
        Include prepare
 diff -r 722c8a7a03e9 tools/ocft/mysql
 --- a/tools/ocft/mysql  Fri Feb 11 18:49:09 2011 +0100
 +++ b/tools/ocft/mysql  Sun Feb 13 10:57:50 2011 +0100
 @@ -46,14 +46,14 @@
        Include prepare
        AgentRun stop OCF_SUCCESS

 -CASE running status
 +CASE running monitor
        Include prepare
        AgentRun start
 -       AgentRun status OCF_SUCCESS
 +       AgentRun monitor OCF_SUCCESS

 -CASE not running status
 +CASE not running monitor
        Include prepare
 -       AgentRun status OCF_NOT_RUNNING
 +       AgentRun monitor OCF_NOT_RUNNING

  CASE check lib file
        Include prepare


 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] Documenting ocf:pacemaker:ping

2011-02-10 Thread Andrew Beekhof
On Thu, Feb 10, 2011 at 9:14 AM, Ulrich Windl
ulrich.wi...@rz.uni-regensburg.de wrote:
 Hi!

 I'm getting into Linux-HA, and it seems the documentation was made with a 
 very hot needle. For example, ocf:pacemaker:ping has the following 
 documentation (crm(live)ra# info ocf:ping in SLES11 SP3+Updates):
 crm(live)ra# info ocf:ping
     PID file

 dampen (integer, [5s]): Dampening interval
     The time to wait (dampening) further changes occur

 The sentence above is grammatically wrong.

At least its spelt correctly, which is more than people usually get from me.



 name (string, [pingd]): Attribute name
     The name of the attributes to set.  This is the name to be used in the 
 constraints.

 multiplier (integer): Value multiplier
     The number by which to multiply the number of connected ping nodes by

 Please explain the reason (semantics) for the multiplication!

Please read Pacemaker Explained



 host_list* (string): Host list
     The list of ping nodes to count.

 to count, or to ping, or both?

My, we are pedantic today.

To ping and count towards the number of active hosts

 attempts (integer, [2]): no. of ping attempts
     Number of ping attempts, per host, before declaring it dead

 timeout (integer, [2]): ping timeout in seconds
     How long, in seconds, to wait before declaring a ping lost

 a ping lost == a host dead?

Yes

 Or is it the the monitor reports a failure?

No

 options (string): Extra Options
     A catch all for any other options that need to be passed to ping.

 What about Additional options to be passed to ping?

This text is too short, this text is too long... are you ever happy?



 debug (string, [false]): Verbose logging
     Enables to use default attrd_updater verbose logging on every call.

 What about `true' enables verbose logging?


 Operations' defaults (advisory minimum):

     start         timeout=60
     stop          timeout=20
     reload        timeout=100
     monitor_0     interval=10 timeout=60

 Is it really monitor_0? What is that _0?

I'm guessing a typo


 Regards,
 Ulrich


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Documenting ocf:pacemaker:ping

2011-02-10 Thread Andrew Beekhof
On Thu, Feb 10, 2011 at 12:28 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 On Thu, Feb 10, 2011 at 09:29:41AM +0100, Andrew Beekhof wrote:
 On Thu, Feb 10, 2011 at 9:14 AM, Ulrich Windl
 ulrich.wi...@rz.uni-regensburg.de wrote:
  Hi!
 
  I'm getting into Linux-HA, and it seems the documentation was made with 
  a very hot needle. For example, ocf:pacemaker:ping has the following 
  documentation (crm(live)ra# info ocf:ping in SLES11 SP3+Updates):
  crm(live)ra# info ocf:ping
      PID file
 
  dampen (integer, [5s]): Dampening interval
      The time to wait (dampening) further changes occur
 
  The sentence above is grammatically wrong.

 At least its spelt correctly, which is more than people usually get from me.

 
 
  name (string, [pingd]): Attribute name
      The name of the attributes to set.  This is the name to be used in 
  the constraints.
 
  multiplier (integer): Value multiplier
      The number by which to multiply the number of connected ping nodes by
 
  Please explain the reason (semantics) for the multiplication!

 Please read Pacemaker Explained

 
 
  host_list* (string): Host list
      The list of ping nodes to count.
 
  to count, or to ping, or both?

 My, we are pedantic today.

 To ping and count towards the number of active hosts

  attempts (integer, [2]): no. of ping attempts
      Number of ping attempts, per host, before declaring it dead
 
  timeout (integer, [2]): ping timeout in seconds
      How long, in seconds, to wait before declaring a ping lost
 
  a ping lost == a host dead?

 Yes

  Or is it the the monitor reports a failure?

 No

  options (string): Extra Options
      A catch all for any other options that need to be passed to ping.
 
  What about Additional options to be passed to ping?

 This text is too short, this text is too long... are you ever happy?

 
 
  debug (string, [false]): Verbose logging
      Enables to use default attrd_updater verbose logging on every call.
 
  What about `true' enables verbose logging?
 
 
  Operations' defaults (advisory minimum):
 
      start         timeout=60
      stop          timeout=20
      reload        timeout=100
      monitor_0     interval=10 timeout=60
 
  Is it really monitor_0? What is that _0?

 I'm guessing a typo

 Not a typo, but monitor with depth check 0. Well, perhaps
 appending depth could be skipped in this case.

Oh, I mistook it for a non-recurring monitor op
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] New master/slave resource agent for DB2 databases in HADR (High Availability Disaster Recovery) mode

2011-02-09 Thread Andrew Beekhof
On Wed, Feb 9, 2011 at 12:15 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 On Wed, Feb 09, 2011 at 12:06:04PM +0100, Florian Haas wrote:
 On 2011-02-09 11:56, Dejan Muhamedagic wrote:
  It is plugin compatible to the old version of the agent.
 
  Great! Unfortunately, we can't replace the old db2 now, the
  number of changes is very large:
 
   db2 | 1076 
  +++-
   1 file changed, 687 insertions(+), 389 deletions(-)
 
  And the code is completely new (though I have no doubt that it is
  of excellent quality). So, I'd suggest to add this as another db2
  RA. Once it gets some field testing we can mark the old one as
  deprecated. What name would you suggest? db2db2?

 Just making sure: Is that a joke?

 A bit of a joke, yes. But the alternatives such as db22 or db2new
 looked a bit boring.

I think boring is the least of our problems with those names.
Are you going to change the name of every agent that gets a rewrite?

   IPaddr2-ng-ng-again-and-one-more-plus-one

Solicit feedback, like was done for kliend's new agent, and replace
the existing one it if/when people respond positively.
Its not like the old one disappears from the face of the earth after
you merge the new one.

   wget -o /usr/lib/ocf/resource.d/heartbeat/db2
http://hg.linux-ha.org/agents/file/agents-1.0.3/heartbeat/db2


  HADR is a very different beast from non-HADR db, right? Why not
  then add the hadr boolean parameter and use that instead of
  checking if the resource has been configured as multi-state?

 I'll take responsibility for suggesting the use of ocf_is_ms(), and I'd
 be curious to find out what you think is wrong with that approach.

 There's nothing wrong in the sense whether it is going to work.
 But someday, db2 may sport say HADR2 or VHA or whatever else
 which may run as a ms resource. I just think that it's better to
 make it obvious in the configuration that the user runs HADR.
 Does that make sense?

 Because if anything is, then the mysql RA needs fixing too.

 No idea what's up with mysql.

 Cheers,

 Dejan

 Florian




 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] New master/slave resource agent for DB2 databases in HADR (High Availability Disaster Recovery) mode

2011-02-09 Thread Andrew Beekhof
On Wed, Feb 9, 2011 at 2:17 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 Hi Andrew,

 On Wed, Feb 09, 2011 at 01:33:03PM +0100, Andrew Beekhof wrote:
 On Wed, Feb 9, 2011 at 12:15 PM, Dejan Muhamedagic deja...@fastmail.fm 
 wrote:
  On Wed, Feb 09, 2011 at 12:06:04PM +0100, Florian Haas wrote:
  On 2011-02-09 11:56, Dejan Muhamedagic wrote:
   It is plugin compatible to the old version of the agent.
  
   Great! Unfortunately, we can't replace the old db2 now, the
   number of changes is very large:
  
    db2 | 1076 
   +++-
    1 file changed, 687 insertions(+), 389 deletions(-)
  
   And the code is completely new (though I have no doubt that it is
   of excellent quality). So, I'd suggest to add this as another db2
   RA. Once it gets some field testing we can mark the old one as
   deprecated. What name would you suggest? db2db2?
 
  Just making sure: Is that a joke?
 
  A bit of a joke, yes. But the alternatives such as db22 or db2new
  looked a bit boring.

 I think boring is the least of our problems with those names.
 Are you going to change the name of every agent that gets a rewrite?

    IPaddr2-ng-ng-again-and-one-more-plus-one

 I don't think it is going to happen that often.

It happens often enough - its just normally by a core developer.
And realistically, almost every RA is going to get similar treatment
(over time) as they're merged with the Red Hat ones.


 Solicit feedback, like was done for kliend's new agent, and replace
 the existing one it if/when people respond positively.

 That would be for the best, but it takes time. We may opt for it,
 but I wanted to add the this agent to the new release.

Understood - but I think the long-term pain that is created outweighs
any perceived benefit in the short-term.

 Also, it
 is very seldom that people test anything which is not contained
 in the release. Unless there's no alternative as was the case
 with conntrac.

 Its not like the old one disappears from the face of the earth after
 you merge the new one.

    wget -o /usr/lib/ocf/resource.d/heartbeat/db2
 http://hg.linux-ha.org/agents/file/agents-1.0.3/heartbeat/db2

 What do you suggest? That we add to the release announcement:

        The db2 RA has been rewritten and didn't get yet a lot of
        field testing. Please help test it.

So don't do that :-)
Put up a wiki page with instructions for how to download+use the new
agent and give feedback.

If the new version is significantly better, you're going to hear
people pleading for its inclusion pretty soon.

 But, if you want to keep
        the old agent, download the old one from the repository and
        use it instead of the new one. And don't forget to do the
        same when installing the next resource-agents release.

 At any rate, I wouldn't want to take responsibility for replacing
 the existing (and working RA) with a completely new and not yet
 tested code. Call me coward :)

I wouldn't either - which is why I keep saying test then replace :-)
Another alternative, create a testing provider... not sure if its a
good idea or not, just putting it out there.

 Finally, I expected that the new functionality is going to be
 added without much changes to the existing code. But it turned
 out to be a rewrite.

 Cheers,

 Dejan

   HADR is a very different beast from non-HADR db, right? Why not
   then add the hadr boolean parameter and use that instead of
   checking if the resource has been configured as multi-state?
 
  I'll take responsibility for suggesting the use of ocf_is_ms(), and I'd
  be curious to find out what you think is wrong with that approach.
 
  There's nothing wrong in the sense whether it is going to work.
  But someday, db2 may sport say HADR2 or VHA or whatever else
  which may run as a ms resource. I just think that it's better to
  make it obvious in the configuration that the user runs HADR.
  Does that make sense?
 
  Because if anything is, then the mysql RA needs fixing too.
 
  No idea what's up with mysql.
 
  Cheers,
 
  Dejan
 
  Florian
 
 
 
 
  ___
  Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
  Home Page: http://linux-ha.org/
 
  ___
  Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
  Home Page: http://linux-ha.org/
 
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http

Re: [Linux-ha-dev] New master/slave resource agent for DB2 databases in HADR (High Availability Disaster Recovery) mode

2011-02-09 Thread Andrew Beekhof
On Wed, Feb 9, 2011 at 3:35 PM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Wed, Feb 09, 2011 at 02:43:17PM +0100, Andrew Beekhof wrote:
  Are you going to change the name of every agent that gets a rewrite?
 
     IPaddr2-ng-ng-again-and-one-more-plus-one
 
  I don't think it is going to happen that often.

 It happens often enough - its just normally by a core developer.
 And realistically, almost every RA is going to get similar treatment
 (over time) as they're merged with the Red Hat ones.

 
  Solicit feedback, like was done for kliend's new agent, and replace
  the existing one it if/when people respond positively.


  Its not like the old one disappears from the face of the earth after
  you merge the new one.
 
     wget -o /usr/lib/ocf/resource.d/heartbeat/db2
  http://hg.linux-ha.org/agents/file/agents-1.0.3/heartbeat/db2
 
  What do you suggest? That we add to the release announcement:
 
         The db2 RA has been rewritten and didn't get yet a lot of
         field testing. Please help test it.

 So don't do that :-)
 Put up a wiki page with instructions for how to download+use the new
 agent and give feedback.

 How about a staging area?
 /usr/lib/ocf/resource.d/staging/

I was thinking along the same lines when I said testing.
Either name works for me :-)


 we can also add a
 /usr/lib/ocf/resource.d/deprecated/

 The thing in  .../heartbeat/ can become a symlink,
 and be given config file status by the package manager?
 Something like that.

 So we have it bundled with the release,
 it is readily available without much go to that web page and download
 and save to there and make executable and then blah.

 It would simply pop up in crm shell and DRBD-MC and so on.

 We can add please give feedback to the description,
 and this will replace the current RA with release + 2
 unless we get veto-ing feedback to the release notes.

 Once settled, we copy over the staging one to the real directory,
 replacing the original one, and add a please fix your config to the
 thing that remains in staging/, so we will be able to start a further
 rewrite with the next merge window.

  * does not break existing setups
  * new RAs and rewrites are readily available


 --
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com

 DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] how to configure heartbeat for polling MySql server?

2011-02-09 Thread Andrew Beekhof
On Wed, Feb 2, 2011 at 5:28 PM, Danilo danilo.abbasci...@gmail.com wrote:
 On 02/01/2011 09:32 AM, Cristian Mammoli - Apra Sistemi wrote:
 On 01/31/2011 10:08 AM, Danilo Abbasciano wrote:

 If the running node is rebooted the cluster works and the service will
 switched to the other node. But if I stop MySql, the cluster don't
 switch or try to restart it.

 How to configure the cluster to check if MySql service is alive?


 Use the mysql ocf resource agent, not the lsb init script. Then
 configure monitoring operations.


 Hi!
 Thanks for your help. I found a good documentation here
  http://www.linux-ha.org/wiki/Mysql_%28resource_agent%29
 that sounds good. But I have another problem.

 I have an old version of heartbeat, my version is heartbeat-2.1.4-4.1 I don't
 have the crm management program to configure primitives. But I have crm_sh,
 could I use it instead of crm? And How?

 The cluster is on a critical system and I prefer don't update it.

If it really is a critical system, then, as the author of the crm, I
beg you to update it.

 Thanks in advance.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] resource stickness question

2011-02-09 Thread Andrew Beekhof
On Wed, Feb 9, 2011 at 9:05 AM, Erik Dobák erik.do...@gmail.com wrote:
 i have 2 nodes which are running 2 resource groups in an active/passive
 cluster. my goal was to run 1 resource active on lc-cl1 and the other
 resource on node lc-cl2.

 this is how i configured it:

 group bamcluster ipaddr2 lcbam \
        meta target-role=Started is-managed=true
 group bamclusteruat ipaddr2uat lcbamuat

 location primarylc bamcluster 100: lc-cl1
 location primarylcuat bamclusteruat 100: lc-cl2
 location secondarylc bamcluster 50: lc-cl2
 location secondarylcuat bamclusteruat 50: lc-cl1
 rsc_defaults $id=rsc-options \
        resource-stickiness=75

 but after adding the second resource (bamclusteruat) did also try to start
 at lc-cl1

of course, both resources prefer that host the most.
only if primarylcuat is down will they run elsewhere.


 what did i mess up?

check up on colocation constraints (with a negative score)


 E
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


<    1   2   3   4   5   6   7   8   9   10   >