[Pacemaker] Configuring Dependencies and Groups in CRM

2014-11-16 Thread Stephan

Hello,

I've a cluster with two nodes with a master-slave DRBD and filesystem. 
On top of the filesystem there are several OpenVZ containers which are 
managed by ocf:heartbeat:ManageVE.


Currently I've configured the VEs within one group together with DRBD 
and filesystem. This setup is running fine but creates a depencency 
between the VEs which is not neccessary and makes it impossible to stop 
one VE without restarting all VEs that come later in the group.


Is there any possibility to configure the filesystem as depencency for 
all VEs but not to define an order between the VEs? I've tried to create 
seperate groups for all VEs but then CRM warns that the filesystem 
resource is already in use.


Important parts from CRM:


primitive res_DRBD ocf:heartbeat:drbd
primitive res_Filesystem ocf:heartbeat:Filesystem

primitive res_ve100 ocf:heartbeat:ManageVE
primitive res_ve101 ocf:heartbeat:ManageVE
primitive res_ve102 ocf:heartbeat:ManageVE
primitive res_ve103 ocf:heartbeat:ManageVE

group group_VE res_Filesystem res_ve100 res_ve101 res_ve102 res_ve103

ms ms_DRBD res_DRBD

colocation col_FSDRBD inf: group_VE ms_DRBD:Master

order order_DRBDFS inf: ms_DRBD:promote group_VE:start


Thanks, Stephan


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Hawk and runtime environment

2014-11-16 Thread LGL Extern
I managed to get Hawk running on Solaris.
I use as prefix /opt and not /usr.

The last problem seems to be the runtime environment of the ruby application.

For testing purposes I started lighthhtpd in a shell in the foreground.
In this shell crm works without any problem.
If in hawk a function is used which uses crm the error message occurs when 
PYTHONPATH is not defined.
Crmsh is located in /opt/lib/python2.6/site-packages/.
When I define links into /usr/lib/python2.6/site-packages hawk can us crm.

A similar problem exists with hb_report.
I checked hb_report in a shell and it works with some minor problems.

Used in hawk I get error messages that egrep can’t be used.
Hb_report started over hawk uses /usr/bin/grep and that is the wrong version.
It should use /usr/gnu/bin which is defined in the PATH before /usr/bin.
The error may also stem from a process using the output of hb_report.

Hawk or ruby seems to ignore the environment of the starting process.

Does anyone have an idea how to solve the problem?

Andreas
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Configuring Dependencies and Groups in CRM

2014-11-16 Thread Andrew Beekhof

> On 17 Nov 2014, at 12:54 am, Stephan  wrote:
> 
> Hello,
> 
> I've a cluster with two nodes with a master-slave DRBD and filesystem. On top 
> of the filesystem there are several OpenVZ containers which are managed by 
> ocf:heartbeat:ManageVE.
> 
> Currently I've configured the VEs within one group together with DRBD and 
> filesystem. This setup is running fine but creates a depencency between the 
> VEs which is not neccessary and makes it impossible to stop one VE without 
> restarting all VEs that come later in the group.
> 
> Is there any possibility to configure the filesystem as depencency for all 
> VEs but not to define an order between the VEs? I've tried to create seperate 
> groups for all VEs but then CRM warns that the filesystem resource is already 
> in use.

Don't use groups. Use colocation and ordering constraints instead.

> 
> Important parts from CRM:
> 
> 
> primitive res_DRBD ocf:heartbeat:drbd
> primitive res_Filesystem ocf:heartbeat:Filesystem
> 
> primitive res_ve100 ocf:heartbeat:ManageVE
> primitive res_ve101 ocf:heartbeat:ManageVE
> primitive res_ve102 ocf:heartbeat:ManageVE
> primitive res_ve103 ocf:heartbeat:ManageVE
> 
> group group_VE res_Filesystem res_ve100 res_ve101 res_ve102 res_ve103
> 
> ms ms_DRBD res_DRBD
> 
> colocation col_FSDRBD inf: group_VE ms_DRBD:Master
> 
> order order_DRBDFS inf: ms_DRBD:promote group_VE:start
> 
> 
> Thanks, Stephan
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Operation attribute change leads to resource restart

2014-11-16 Thread Andrew Beekhof

> On 15 Nov 2014, at 8:46 am, Vladislav Bogdanov  wrote:
> 
> 14.11.2014 17:36, David Vossel пишет:
>> 
>> 
>> - Original Message -
>>> Hi!
>>> 
>>> Just noticed that deletion of a trace_ra op attribute forces resource
>>> to be restarted (that RA does not support reload).
>>> 
>>> Logs show:
>>> Nov 13 09:06:05 [6633] node01cib: info: cib_process_request:
>>> Forwarding cib_apply_diff operation for section 'all' to master
>>> (origin=local/cibadmin/2)
>>> Nov 13 09:06:05 [6633] node01cib: info: cib_perform_op: 
>>> Diff:
>>> --- 0.641.96 2
>>> Nov 13 09:06:05 [6633] node01cib: info: cib_perform_op: 
>>> Diff:
>>> +++ 0.643.0 98ecbda94c7e87250cf2262bf89f43e8
>>> Nov 13 09:06:05 [6633] node01cib: info: cib_perform_op: --
>>> /cib/configuration/resources/clone[@id='cl-test-instance']/primitive[@id='test-instance']/operations/op[@id='test-instance-start-0']/instance_attributes[@id='test-instance-start-0-instance_attributes']
>>> Nov 13 09:06:05 [6633] node01cib: info: cib_perform_op: +
>>> /cib:  @epoch=643, @num_updates=0
>>> Nov 13 09:06:05 [6633] node01cib: info: cib_process_request:
>>> Completed cib_apply_diff operation for section 'all': OK (rc=0,
>>> origin=node01/cibadmin/2, version=0.643.0)
>>> Nov 13 09:06:05 [6638] node01   crmd: info: abort_transition_graph:
>>> Transition aborted by deletion of
>>> instance_attributes[@id='test-instance-start-0-instance_attributes']:
>>> Non-status change (cib=0.643.0, source=te_update_diff:383,
>>> path=/cib/configuration/resources/clone[@id='cl-test-instance']/primitive[@id='test-instance']/operations/op[@id='test-instance-start-0']/instance_attributes[@id='test-instance-start-0-instance_attributes'],
>>> 1)
>>> Nov 13 09:06:05 [6638] node01   crmd:   notice: do_state_transition:
>>> State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>>> cause=C_FSA_INTERNAL origin=abort_transition_graph]
>>> Nov 13 09:06:05 [6634] node01 stonith-ng: info: xml_apply_patchset:
>>> v2 digest mis-match: expected 98ecbda94c7e87250cf2262bf89f43e8,
>>> calculated 0b344571f3e1bb852e3d10ca23183688
>>> Nov 13 09:06:05 [6634] node01 stonith-ng:   notice: update_cib_cache_cb:
>>> [cib_diff_notify] Patch aborted: Application of an update diff failed
>>> (-206)
>>> ...
>>> Nov 13 09:06:05 [6637] node01pengine: info: check_action_definition:
>>> params:reload   >> config_uri="http://192.168.168.10:8080/cgi-bin/manage_config.cgi?action=%a&resource=%n&instance=%i";
>>> start_vm="1" vlan_id_start="2" per_vlan_ip_prefix_len="24"
>>> base_img="http://192.168.168.10:8080/pre45-mguard-virt.x86_64.default.qcow2";
>>> pool_name="default" outer_phy="eth0" ip_range_prefix="10.101.0.0/16"/>
>>> Nov 13 09:06:05 [6637] node01pengine: info: check_action_definition:
>>> Parameters to test-instance:0_start_0 on rnode001 changed: was
>>> 6f9eb6bd1f87a2b9b542c31cf1b9c57e vs. now 02256597297dbb42aadc55d8d94e8c7f
>>> (reload:3.0.9) 0:0;41:3:0:95e66b6a-a190-4e61-83a7-47165fb0105d
>>> ...
>>> Nov 13 09:06:05 [6637] node01pengine:   notice: LogActions: Restart
>>> test-instance:0 (Started rnode001)
>>> 
>>> That is not what I'd expect to see.
>> 
>> Any time an instance attribute is changed for a resource, the resource is 
>> restarted/reloaded.
>> This is expected.
> 
> May be it makes sense to handle very special 'trace_ra' attribute in a
> special way?

Never heard of it.  Why should it be treated specially?


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] HA setup of MySQL service using Pacemaker/DRBD

2014-11-16 Thread Sihan Goi
Hi,

I have a working 2 node HA setup running on CentOS 6.5 with a very simple
Apache webserver with replicated index.html using DRBD 8.4. The setup is
configured based on the "Clusters from Scratch" Edition 5 with Fedora 13.

I now with to replace Apache with a MySQL database, or just simply add it.
How can I do so? I'm guessing the following:
1 . Add MySQL service to the cluster with a "crm configure primitive"
command. I'm not sure what the params should be though, e.g. the configfile.
2. Set the same colocation/order rules.
3. Create/initialize a separate DRBD partition for MySQL (or can I reuse
the same partition as Apache assuming I'll never exceed its capacity?)
4. Copy the database/table into the mounted DRBD partition.
5. Configure the cluster for DRBD as per Chapter 7.4 of the guide.

Is this correct? Step by step instructions would be appreciated, I have
some experience in RHEL/CentOS but not in HA nor MySQL.

Thanks!

-- 
- Goi Sihan
gois...@gmail.com
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] resource-stickiness not working?

2014-11-16 Thread Andrew Beekhof

> On 14 Nov 2014, at 5:52 am, Scott Donoho  wrote:
> 
> Here is a simple Active/Passive configuration with a single Dummy resource 
> (see end of message). The resource-stickiness default is set to 100. I was 
> assuming that this would be enough to keep the Dummy resource on the active 
> node as long as the active node stays healthy. However, stickiness is not 
> working as I expected in the following scenario:
> 
> 1) The node testnode1, which is running the Dummy resource, reboots or crashes
> 2) Dummy resource fails to node testnode2
> 3) testnode1 comes back up after reboot or crash

When this happens, the cluster will check what state Dummy is in on testnode1.
My guess is that Dummy thinks it is still active (based on a stale lock file) 
and recovery is initiated quick enough that it looks like a 'normal' migration

> 4) Dummy resource fails back to testnode1 
> 
> I don't want the resource  to failback to the original node in step 4. That 
> is why resource-stickiness is set to 100. The only way I can get the resource 
> to not to fail back is to set resource-stickiness to INFINITY. Is this the 
> correct behavior of resource-stickiness? What am I missing? This is not what 
> I understand from the documentation from clusterlabs.org. BTW, after reading 
> various postings on fail back issues, I played with setting on-fail to 
> standby, but that doesn't seem to help either. Any help is appreciated!
> 
>Scott
> 
> node testnode1
> node testnode2
> primitive dummy ocf:heartbeat:Dummy \
> op start timeout="180s" interval="0" \
> op stop timeout="180s" interval="0" \
> op monitor interval="60s" timeout="60s" migration-threshold="5"
> xml  node="testnode2" score="INFINITY"/>
> property $id="cib-bootstrap-options" \
> dc-version="1.1.10-14.el6-368c726" \
> cluster-infrastructure="classic openais (with plugin)" \
> expected-quorum-votes="2" \
> stonith-enabled="false" \
> stonith-action="reboot" \
> no-quorum-policy="ignore" \
> last-lrm-refresh="1413378119"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="100" \
> migration-threshold="5"
> 
> 
>   
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Long failover

2014-11-16 Thread Andrew Beekhof

> On 14 Nov 2014, at 10:57 pm, Dmitry Matveichev  
> wrote:
> 
> Hello, 
>  
> We have a cluster configured via pacemaker+corosync+crm. The configuration is:
>  
> node master
> node slave
> primitive HA-VIP1 IPaddr2 \
> params ip=192.168.22.71 nic=bond0 \
> op monitor interval=1s
> primitive HA-variator lsb: variator \
> op monitor interval=1s \
> meta migration-threshold=1 failure-timeout=1s
> group HA-Group HA-VIP1 HA-variator
> property cib-bootstrap-options: \
> dc-version=1.1.10-14.el6-368c726 \
> cluster-infrastructure="classic openais (with plugin)" \

General advice, don't use the plugin. See:

http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/

> expected-quorum-votes=2 \
> stonith-enabled=false \
>no-quorum-policy=ignore \
> last-lrm-refresh=1383871087
> rsc_defaults rsc-options: \
> resource-stickiness=100
>  
> Firstly I make the variator service down  on the master node (actually I 
> delete the service binary and kill the variator process, so the variator 
> fails to restart). Resources very quickly move on the slave node as expected. 
> Then I return the binary on the master and restart the variator service. Now 
> I make the same stuff with binary and service on slave node. The crm status 
> command quickly shows me HA-variator   (lsb: variator):Stopped. But 
> it take to much time (for us) before recourses are switched on the master 
> node (around 1 min).  

I see what you mean:

2013-12-21T07:04:12.230827+04:00 master crmd[14267]:   notice: te_rsc_command: 
Initiating action 2: monitor HA-variator_monitor_1000 on slave.mfisoft.ru
2013-12-21T05:45:09+04:00 slave crmd[7086]:   notice: process_lrm_event: 
slave.mfisoft.ru-HA-variator_monitor_1000:106 [ variator.x is stopped\n ]

(1 minute goes by)

2013-12-21T07:05:14.232029+04:00 master crmd[14267]:error: print_synapse: 
[Action2]: In-flight rsc op HA-variator_monitor_1000 on slave.mfisoft.ru 
(priority: 0, waiting: none)
2013-12-21T07:05:14.232102+04:00 master crmd[14267]:  warning: 
cib_action_update: rsc_op 2: HA-variator_monitor_1000 on slave.mfisoft.ru timed 
out

Is there a corosync log file configured?  That would have more detail on slave.

> Then line 
> Failed actions:
> HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1, 
> status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013', queued=0ms, 
> exec=0ms
> appears in the crm status and recourses are switched.
>  
> What is that timeout? Where I can change it?
>  
> 
> Kind regards,
> Dmitriy Matveichev.
>  
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] doubt when cloning a resource

2014-11-16 Thread Andrew Beekhof

> On 15 Nov 2014, at 3:17 am, david escartin  wrote:
> 
> Hello all
> 
> we are trying to have in a 2 node cluster one resource TEST  (LSB type) cloned

Thats your problem.
LSB resources cannot be cloned with globally-unique="true" 

Why do you think you need globally-unique="true" ?

> but we woul like to have it running in the 2 nodes and be able to stop and 
> start it on each independently
> then we would like to associate to it some virtual iP type OCF using a 
> location or something like that
> 
> but when trying to do that using gloal-unique, i get an error and seems the 
> system is not working well.
> crm configure clone TEST-clone TEST meta globally-unique="true" 
> Warnings found during check: config may not be valid 
> Do you still want to commit (y/n)? n
> 
> 
> and if i create the clone normally, then if i stop the resource in one node, 
> it stops also in the other
> 
> do you please have any clue to perform this?
> 
> thanks a lot and regards
> david
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Reset failcount for resources

2014-11-16 Thread Andrew Beekhof

> On 13 Nov 2014, at 10:08 pm, Arjun Pandey  wrote:
> 
> Hi 
> 
> I am running a 2 node cluster with this config
> 
> Master/Slave Set: foo-master [foo]
> Masters: [ bharat ]
> Slaves: [ ram ]
> AC_FLT (ocf::pw:IPaddr): Started bharat
> CR_CP_FLT (ocf::pw:IPaddr): Started bharat
> CR_UP_FLT (ocf::pw:IPaddr): Started bharat
> Mgmt_FLT (ocf::pw:IPaddr): Started bharat
> 
> where IPaddr RA is just modified IPAddr2 RA. Additionally i have a
> collocation constraint for the IP addr to be collocated with the master.
> I have set the migration-threshold as 2 for the VIP. I also have set the 
> failure-timeout to 15s.
> 
> 
> Initially i bring down the interface on bharat to force switch-over to ram. 
> After this i fail the interfaces on bharat again. Now i bring the interface 
> up again on ram. However the virtual IP's are now in stopped state.
> 
> I don't get out of this unless i use crm_resource -C to reset state of 
> resources.
> However if i check failcount of resources after this it's still set as 
> INFINITY.

crm_resource didn't always reset the failcount. I'd encourage you to upgrade 
your pacemaker packages.

> Based on the documentation the failcount on a node should have expired after 
> the failure-timeout.That doesn't happen. However why don't we reset the count 
> after the the crm_resource -C command too. Any other command to actually 
> reset the failcount.

There should be 'crm_failcount' that will do this

> 
> Thanks in advance
> 
> Regards
> Arjun
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Intermittent Failovers: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)

2014-11-16 Thread Andrew Beekhof

> On 11 Nov 2014, at 1:32 am, Zach Wolf  wrote:
> 
> Hey Team,
> 
> I’m receiving some strange intermittent failovers on a two-node cluster 
> (happens once every week or two). When this happens, both nodes are 
> unavailable; one node will be marked offline and the other will be shown as 
> unclean. Any help on this would be massively appreciated. Thanks.
> 
> Running Ubuntu 12.04 (64-bit)
> Pacemaker 1.1.6-2ubuntu3.3
> Corosync 1.4.2-2ubuntu0.2
> 
> Here are the logs:
> Nov 08 14:26:26 corosync [pcmk  ] info: pcmk_ipc_exit: Client crmd 
> (conn=0x12bebe0, async-conn=0x12bebe0) left
> Nov 08 14:26:26 corosync [pcmk  ] WARN: route_ais_message: Sending message to 
> local.crmd failed: ipc delivery failed (rc=-2)
> Nov 08 14:26:27 corosync [pcmk  ] info: pcmk_ipc_exit: Client attrd 
> (conn=0x12d0230, async-conn=0x12d0230) left
> Nov 08 14:26:32 corosync [pcmk  ] info: pcmk_ipc_exit: Client cib 
> (conn=0x12c7d80, async-conn=0x12c7d80) left
> Nov 08 14:26:32 corosync [pcmk  ] info: pcmk_ipc_exit: Client stonith-ng 
> (conn=0x12c3a20, async-conn=0x12c3a20) left
> Nov 08 14:26:32 corosync [pcmk  ] WARN: route_ais_message: Sending message to 
> local.crmd failed: ipc delivery failed (rc=-2)
> Nov 08 14:26:32 corosync [pcmk  ] WARN: route_ais_message: Sending message to 
> local.cib failed: ipc delivery failed (rc=-2)

Nothing at all from the crmd, cib, attrd or stonith-ng processes?

> Nov 08 14:26:32 corosync [pcmk  ] info: pcmk_ipc: Recorded connection 
> 0x12bebe0 for stonith-ng/0
> Nov 08 14:26:32 corosync [pcmk  ] info: pcmk_ipc: Recorded connection 
> 0x12c2f40 for attrd/0
> Nov 08 14:26:33 corosync [pcmk  ] info: pcmk_ipc: Recorded connection 
> 0x12c72a0 for cib/0
> Nov 08 14:26:33 corosync [pcmk  ] info: pcmk_ipc: Sending membership update 
> 12 to cib
> Nov 08 14:26:33 corosync [pcmk  ] info: pcmk_ipc: Recorded connection 
> 0x12cb600 for crmd/0
> Nov 08 14:26:33 corosync [pcmk  ] info: pcmk_ipc: Sending membership update 
> 12 to crmd
> 
> Output of crm configure show:
> node p-sbc3 \
> attributes standby="off"
> node p-sbc4 \
> attributes standby="off"
> primitive fs lsb:FSSofia \
> op monitor interval="2s" enabled="true" timeout="10s" 
> on-fail="standby" \
> meta target-role="Started"
> primitive fs-ip ocf:heartbeat:IPaddr2 \
> params ip="10.100.0.90" nic="eth0:0" cidr_netmask="24" \
> op monitor interval="10s"
> primitive fs-ip2 ocf:heartbeat:IPaddr2 \
> params ip="10.100.0.99" nic="eth0:1" cidr_netmask="24" \
> op monitor interval="10s"
> group cluster_services fs-ip fs-ip2 fs \
> meta target-role="Started"
> property $id="cib-bootstrap-options" \
> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith-enabled="false" \
> last-lrm-refresh="1348755080" \
> no-quorum-policy="ignore"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="100"
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Reset failcount for resources

2014-11-16 Thread Alexandre
Le 13 nov. 2014 12:09, "Arjun Pandey"  a écrit :
>
> Hi
>
> I am running a 2 node cluster with this config
>
> Master/Slave Set: foo-master [foo]
> Masters: [ bharat ]
> Slaves: [ ram ]
> AC_FLT (ocf::pw:IPaddr): Started bharat
> CR_CP_FLT (ocf::pw:IPaddr): Started bharat
> CR_UP_FLT (ocf::pw:IPaddr): Started bharat
> Mgmt_FLT (ocf::pw:IPaddr): Started bharat
>
> where IPaddr RA is just modified IPAddr2 RA. Additionally i have a
> collocation constraint for the IP addr to be collocated with the master.
> I have set the migration-threshold as 2 for the VIP. I also have set the
failure-timeout to 15s.
>
>
> Initially i bring down the interface on bharat to force switch-over to
ram. After this i fail the interfaces on bharat again. Now i bring the
interface up again on ram. However the virtual IP's are now in stopped
state.
>
> I don't get out of this unless i use crm_resource -C to reset state of
resources.
> However if i check failcount of resources after this it's still set as
INFINITY.
> Based on the documentation the failcount on a node should have expired
after the failure-timeout.That doesn't happen.

Expiration probably happens, meaning the failure is marked for expiration.
However, expired failures are only removed when the timer pops in, which is
defined by the cluster-recheck-interval (by default 15 mins).

> However why don't we reset the count after the the crm_resource -C
command too. Any other command to actually reset the failcount.
>
> Thanks in advance
>
> Regards
> Arjun
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Long failover

2014-11-16 Thread Andrei Borzenkov
On Mon, Nov 17, 2014 at 9:34 AM, Andrew Beekhof  wrote:
>
>> On 14 Nov 2014, at 10:57 pm, Dmitry Matveichev  
>> wrote:
>>
>> Hello,
>>
>> We have a cluster configured via pacemaker+corosync+crm. The configuration 
>> is:
>>
>> node master
>> node slave
>> primitive HA-VIP1 IPaddr2 \
>> params ip=192.168.22.71 nic=bond0 \
>> op monitor interval=1s
>> primitive HA-variator lsb: variator \
>> op monitor interval=1s \
>> meta migration-threshold=1 failure-timeout=1s
>> group HA-Group HA-VIP1 HA-variator
>> property cib-bootstrap-options: \
>> dc-version=1.1.10-14.el6-368c726 \
>> cluster-infrastructure="classic openais (with plugin)" \
>
> General advice, don't use the plugin. See:
>
> http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
> http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/
>
>> expected-quorum-votes=2 \
>> stonith-enabled=false \
>>no-quorum-policy=ignore \
>> last-lrm-refresh=1383871087
>> rsc_defaults rsc-options: \
>> resource-stickiness=100
>>
>> Firstly I make the variator service down  on the master node (actually I 
>> delete the service binary and kill the variator process, so the variator 
>> fails to restart). Resources very quickly move on the slave node as 
>> expected. Then I return the binary on the master and restart the variator 
>> service. Now I make the same stuff with binary and service on slave node. 
>> The crm status command quickly shows me HA-variator   (lsb: variator):   
>>  Stopped. But it take to much time (for us) before recourses are switched on 
>> the master node (around 1 min).
>
> I see what you mean:
>
> 2013-12-21T07:04:12.230827+04:00 master crmd[14267]:   notice: 
> te_rsc_command: Initiating action 2: monitor HA-variator_monitor_1000 on 
> slave.mfisoft.ru
> 2013-12-21T05:45:09+04:00 slave crmd[7086]:   notice: process_lrm_event: 
> slave.mfisoft.ru-HA-variator_monitor_1000:106 [ variator.x is stopped\n ]
>
> (1 minute goes by)
>
> 2013-12-21T07:05:14.232029+04:00 master crmd[14267]:error: print_synapse: 
> [Action2]: In-flight rsc op HA-variator_monitor_1000 on slave.mfisoft.ru 
> (priority: 0, waiting: none)
> 2013-12-21T07:05:14.232102+04:00 master crmd[14267]:  warning: 
> cib_action_update: rsc_op 2: HA-variator_monitor_1000 on slave.mfisoft.ru 
> timed out
>

Is it possible that pacemaker is confused by time difference on master
and slave?

> Is there a corosync log file configured?  That would have more detail on 
> slave.
>
>> Then line
>> Failed actions:
>> HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1, 
>> status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013', queued=0ms, 
>> exec=0ms
>> appears in the crm status and recourses are switched.
>>
>> What is that timeout? Where I can change it?
>>
>> 
>> Kind regards,
>> Dmitriy Matveichev.
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Long failover

2014-11-16 Thread Andrew Beekhof

> On 17 Nov 2014, at 6:17 pm, Andrei Borzenkov  wrote:
> 
> On Mon, Nov 17, 2014 at 9:34 AM, Andrew Beekhof  wrote:
>> 
>>> On 14 Nov 2014, at 10:57 pm, Dmitry Matveichev  
>>> wrote:
>>> 
>>> Hello,
>>> 
>>> We have a cluster configured via pacemaker+corosync+crm. The configuration 
>>> is:
>>> 
>>> node master
>>> node slave
>>> primitive HA-VIP1 IPaddr2 \
>>>params ip=192.168.22.71 nic=bond0 \
>>>op monitor interval=1s
>>> primitive HA-variator lsb: variator \
>>>op monitor interval=1s \
>>>meta migration-threshold=1 failure-timeout=1s
>>> group HA-Group HA-VIP1 HA-variator
>>> property cib-bootstrap-options: \
>>>dc-version=1.1.10-14.el6-368c726 \
>>>cluster-infrastructure="classic openais (with plugin)" \
>> 
>> General advice, don't use the plugin. See:
>> 
>> http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
>> http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/
>> 
>>>expected-quorum-votes=2 \
>>>stonith-enabled=false \
>>>   no-quorum-policy=ignore \
>>>last-lrm-refresh=1383871087
>>> rsc_defaults rsc-options: \
>>>resource-stickiness=100
>>> 
>>> Firstly I make the variator service down  on the master node (actually I 
>>> delete the service binary and kill the variator process, so the variator 
>>> fails to restart). Resources very quickly move on the slave node as 
>>> expected. Then I return the binary on the master and restart the variator 
>>> service. Now I make the same stuff with binary and service on slave node. 
>>> The crm status command quickly shows me HA-variator   (lsb: variator):  
>>>   Stopped. But it take to much time (for us) before recourses are switched 
>>> on the master node (around 1 min).
>> 
>> I see what you mean:
>> 
>> 2013-12-21T07:04:12.230827+04:00 master crmd[14267]:   notice: 
>> te_rsc_command: Initiating action 2: monitor HA-variator_monitor_1000 on 
>> slave.mfisoft.ru
>> 2013-12-21T05:45:09+04:00 slave crmd[7086]:   notice: process_lrm_event: 
>> slave.mfisoft.ru-HA-variator_monitor_1000:106 [ variator.x is stopped\n ]
>> 
>> (1 minute goes by)
>> 
>> 2013-12-21T07:05:14.232029+04:00 master crmd[14267]:error: 
>> print_synapse: [Action2]: In-flight rsc op HA-variator_monitor_1000 on 
>> slave.mfisoft.ru (priority: 0, waiting: none)
>> 2013-12-21T07:05:14.232102+04:00 master crmd[14267]:  warning: 
>> cib_action_update: rsc_op 2: HA-variator_monitor_1000 on slave.mfisoft.ru 
>> timed out
>> 
> 
> Is it possible that pacemaker is confused by time difference on master
> and slave?

Timeouts are all calculated locally. So it shouldn't be an issue (aside from 
trying to read the logs)

> 
>> Is there a corosync log file configured?  That would have more detail on 
>> slave.
>> 
>>> Then line
>>> Failed actions:
>>>HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1, 
>>> status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013', queued=0ms, 
>>> exec=0ms
>>> appears in the crm status and recourses are switched.
>>> 
>>> What is that timeout? Where I can change it?
>>> 
>>> 
>>> Kind regards,
>>> Dmitriy Matveichev.
>>> 
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org