from:"Andreas Kurz"

Re: [Pacemaker] stonith

2015-04-19 Thread Andreas Kurz

On 2015-04-17 12:36, Thomas Manninger wrote:
> Hi list,
>  
> i have a pacemaker/corosync2 setup with 4 nodes, stonith configured over
> ipmi interface.
>  
> My problem is, that sometimes, a wrong node is stonithed.
> As example:
> I have 4 servers: node1, node2, node3, node4
>  
> I start a hardware- reset on node node1, but node1 and node3 will be
> stonithed.

You have to tell pacemaker exactly what stonith-resource can fence what
node if the stonith agent you are using does not support the "list" action.

Do this by adding "pcmk_host_check=static-list" and "pcmk_host_list" to
every stonith-resource like:

primitive p_stonith_node3 stonith:external/ipmi \
  op monitor interval=3s timeout=20s \
  params hostname=node3 ipaddr=10.100.0.6 passwd_method=file
  passwd="/etc/stonith_ipmi_passwd" userid=stonith interface=lanplus
  priv=OPERATOR \
  pcmk_host_check="static-list" pcmk_host_list="node3"

... see "man stonithd".

Best regards,
Andreas

>  
> In the cluster.log, i found following entry:
> Apr 17 11:02:41 [20473] node2   stonithd:debug:
> stonith_action_create:   Initiating action reboot for agent
> fence_legacy (target=node1)
> Apr 17 11:02:41 [20473] node2   stonithd:debug: make_args:  
> Performing reboot action for node 'node1' as 'port=node1'
> Apr 17 11:02:41 [20473] node2   stonithd:debug:
> internal_stonith_action_execute: forking
> Apr 17 11:02:41 [20473] node2   stonithd:debug:
> internal_stonith_action_execute: sending args
> Apr 17 11:02:41 [20473] node2   stonithd:debug:
> stonith_device_execute:  Operation reboot for node node1 on
> p_stonith_node3 now running with pid=113092, timeout=60s
>  
> node1 will be reseted with the stonith primitive of node3 ?? Why??
>  
> my stonith config:
> primitive p_stonith_node1 stonith:external/ipmi \
> params hostname=node1 ipaddr=10.100.0.2 passwd_method=file
> passwd="/etc/stonith_ipmi_passwd" userid=stonith interface=lanplus
> priv=OPERATOR \
> op monitor interval=3s timeout=20s \
> meta target-role=Started failure-timeout=30s
> primitive p_stonith_node2 stonith:external/ipmi \
> op monitor interval=3s timeout=20s \
> params hostname=node2 ipaddr=10.100.0.4 passwd_method=file
> passwd="/etc/stonith_ipmi_passwd" userid=stonith interface=lanplus
> priv=OPERATOR \
> meta target-role=Started failure-timeout=30s
> primitive p_stonith_node3 stonith:external/ipmi \
> op monitor interval=3s timeout=20s \
> params hostname=node3 ipaddr=10.100.0.6 passwd_method=file
> passwd="/etc/stonith_ipmi_passwd" userid=stonith interface=lanplus
> priv=OPERATOR \
> meta target-role=Started failure-timeout=30s
> primitive p_stonith_node4 stonith:external/ipmi \
> op monitor interval=3s timeout=20s \
> params hostname=node4 ipaddr=10.100.0.8 passwd_method=file
> passwd="/etc/stonith_ipmi_passwd" userid=stonith interface=lanplus
> priv=OPERATOR \
> meta target-role=Started failure-timeout=30s
>  
> Somebody can help me??
> Thanks!
>  
> Regards,
> Thomas
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Problem with function ocf_local_nodename

2013-12-12 Thread Andreas Kurz

On 2013-12-12 17:53, Michael Böhm wrote:
> Hi @all,
> 
> i am new to pacemaker and currently setting up a test-environment for
> future production-use. Unfortunately i ran into a problem with using the
> mysql resource agent and i'm hoping someone here can enlighten me.
> 
> Server-Distro is Debian and the resource-agents are far from up-to-date.
> I wanted to use the current ra that comes with the master-slave function
> for mysql so i pulled it off the git-repo[1].
> 
> After changing the path to ocf-shellfuncs (debian obviously uses a
> different one) i ran into another problem: the mysql-ra makes use of the
> function "ocf_local_nodename" and i think this doesnt exist in the
> pacemaker version (1.0.9.1) that is installed.
> 
> How can i fix this? Its only a simple call in line 142:
>> NODENAME=$(ocf_local_nodename)

This function is also part of the ocf-shellfuncs in newer resource-agent
versions ... since that version your are using, quite some parts of the
resource-agents package structure changed.

> 
> Is there a way to set this in the pacemaker-version i am using or should
> i consider upgrading to Version 1.1.7 or even newer? This would mean i
> also had to upgrade the Server i am planning to use this on, but with
> christmas coming i may have the (out)time to do this.

If you are on Squeeze you can also try latest 1.1.10 Pacemaker, I'd say
a good idea if you are dealing with MS resources:

http://people.debian.org/~madkiss/pacemaker-1.1.10/

Regards,
Andreas

> 
> Thanks in advance
> 
> Mika
> 
> [1] https://github.com/ClusterLabs/resource-agents
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] restart resources in clone mode on a single node

2013-12-12 Thread Andreas Kurz

On 2013-12-12 18:06, ESWAR RAO wrote:
> Hi All,
> 
> Can someone please help me in restarting all resources of clone on
> single node.
> 
> On 3 node setup with HB+pacemaker.
> I have configured all 3 resources in clone mode with max as 2 to start
> only on node1 and node2.
> +++
> + crm configure primitive res_dummy_1 lsb::dummy_1 meta
> allow-migrate=false op monitor interval=5s
> + crm configure clone dummy_1_clone res_dummy_1 meta clone-max=2
> globally-unique=false
> + crm configure location dummy_1_clone_prefer_node dummy_1_clone -inf:
> node-3
> +++
> advisory ordering:
> + crm configure order 1-BEFORE-2 inf: dummy_1_clone dummy_2_clone
> + crm configure order 1-BEFORE-3 inf: dummy_1_clone dummy_3_clone
> 
> I expect if I kill dummy_1 on node1 , dummy_2 and dummy_3 also to be
> restarted on node-1 along with dummy_1.
> But dummy_2 and dummy_3 on node-2 are also getting restarted.
> 
> Is there any way so that I can achieve not restarting of dummy_2 and
> dummy_3 on node_2.

yes ... the meta-attribute your are looking for is: interleave=true

Regards,
Andreas

> 
> 
> Thanks
> Eswar
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Colocation constraint to External Managed Resource

2013-10-10 Thread Andreas Kurz

On 2013-10-10 18:20, Robert H. wrote:
> Hello,
> 
> Am 10.10.2013 16:18, schrieb Andreas Kurz:
> 
>> You configured a monitor operation for this unmanaged resource?
> 
> Yes, and some parts work as expected, however some behaviour is strange.
> 
> Config (relevant part only):
> 
> 
> primitive mysql-percona lsb:mysql \
> op start enabled="false" interval="0" \
> op stop enabled="false" interval="0" \
> op monitor enabled="true" timeout="20s" interval="10s" \
> meta migration-threshold="2" failure-timeout="30s"
> is-managed="false"
> clone CLONE-percona mysql-percona \
> meta clone-max="2" clone-node-max="1" is-managed="false"
> location clone-percona-placement CLONE-percona \
> rule $id="clone-percona-placement-rule" -inf: #uname ne NODE1
> and #uname ne NODE2
> colocation APP-dev2-private-percona-withip inf: IP CLONE-percona
> 
> 
> Test:
> 
> 
> I start by both Percona XtraDB machines running:
> 
>  IP-dev2-privatevip1(ocf::heartbeat:IPaddr2):   Started NODE2
>  Clone Set: CLONE-percona [mysql-percona] (unmanaged)
>  mysql-percona:0(lsb:mysql):Started NODE1 (unmanaged)
>  mysql-percona:1(lsb:mysql):Started NODE2 (unmanaged)
> 
> shell# /etc/init.d/mysql stop on NODE2

Have you verified the mysql script is LSB compliant? 
http://goo.gl/UqoHbv

Regards,
Andreas

> 
> ... Pacemaker reacts as expected 
> 
>  IP-dev2-privatevip1(ocf::heartbeat:IPaddr2):   Started NODE1
>  Clone Set: CLONE-percona [mysql-percona] (unmanaged)
>  mysql-percona:0(lsb:mysql):Started NODE1 (unmanaged)
>  mysql-percona:1(lsb:mysql):Started NODE2 (unmanaged) FAILED
> 
>  .. then I wait 
>  .. after some time (1 min), the ressource is shown as running ...
> 
>  IP-dev2-privatevip1(ocf::heartbeat:IPaddr2):   Started NODE1
>  Clone Set: CLONE-percona [mysql-percona] (unmanaged)
>  mysql-percona:0(lsb:mysql):Started NODE1 (unmanaged)
>  mysql-percona:1(lsb:mysql):Started NODE2 (unmanaged)
> 
> But it is definitly not running:
> 
> shell# /etc/init.d/mysql status
> MySQL (Percona XtraDB Cluster) is not running  [FEHLGESCHLAGEN]
> 
> When I run probe "crm resource reprobe" it switches to:
> 
>  IP-dev2-privatevip1(ocf::heartbeat:IPaddr2):   Started NODE1
>  Clone Set: CLONE-percona [mysql-percona] (unmanaged)
>  mysql-percona:0(lsb:mysql):Started NODE1 (unmanaged)
>  Stopped: [ mysql-percona:1 ]
> 
> Then when I start it again:
> 
> /etc/init.d/mysql start on NODE2
> 
> It stays this way:
> 
>  IP-dev2-privatevip1(ocf::heartbeat:IPaddr2):   Started NODE1
>  Clone Set: CLONE-percona [mysql-percona] (unmanaged)
>  mysql-percona:0(lsb:mysql):Started NODE1 (unmanaged)
>  Stopped: [ mysql-percona:1 ]
> 
> Only a manual "reprobe" helps:
> 
>  IP-dev2-privatevip1(ocf::heartbeat:IPaddr2):   Started NODE1
>  Clone Set: CLONE-percona [mysql-percona] (unmanaged)
>  mysql-percona:0(lsb:mysql):Started NODE1 (unmanaged)
>  mysql-percona:1(lsb:mysql):Started NODE2 (unmanaged)
> 
> Same thing happens when I reboot NODE2 (or other way around).
> 
> ---
> 
> I would expect that crm_mon ALWAYS reflects the local state, however it
> looks like a bug for me.
> 
> Any hints whats missing ?
> 
> 
> 
>>
>> Regards,
>> Andreas
>>
> 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Colocation constraint to External Managed Resource

2013-10-10 Thread Andreas Kurz

On 2013-10-09 18:33, Robert H. wrote:
> Hello list,
> 
> I have a question regarding colocation.
> 
> I have an external managed resource (not part of pacemaker, but running
> on the pacemaker nodes as multi master application) - in this case
> XtraDB Cluster. I also want to keep this ressource manually managed.
> 
> For accessing the XtraDb Cluster I use a failover IP "DATABASE_IP"
> (IPaddr2) (I know, I'm actually using XtraDB as a MySQl Master / Slave
> replication alike with Master / Master handling).
> 
> Now I want to have a configuration, where Pacemaker migrates the
> DATABASE_IP away from the node if the externally managed XtraDB is
> stopped. However Pacemaker should not manage the XTraDB ressource for
> start / stopp.
> 
> I tried configuring as lsb:mysql resource with start and stop operations
> "enabled=false" -> did not work.
> I tried configuring as lsb:mysql resource with "is-managed=false" and
> failure-timeout="30s" -> did work, until I rebootet one of the machines.
> Then the resource was not discovered on the node after the reboot (it
> was shown as "stopped", while it was started already).

You configured a monitor operation for this unmanaged resource?

Regards,
Andreas

> 
> I'm using CentOS6 Pacemaker (1.1.8-7.el6-394e906) on corosync.
> 
> Any hints how to configure this ?
> 
> Regards,
> Robert
> 
> -- 
> Robert
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] IPaddr2 between eth1 e bond1

2013-10-06 Thread Andreas Kurz

On 2013-10-05 02:04, Charles Mean wrote:
> Hello guys,
> 
> I have a cluster with 2 nginx sharing one VIP:
> 
> primitive VIP_AD_SRV ocf:heartbeat:IPaddr2 params ip="X.Y.Z.W"
> cidr_netmask="30" nic="eth1" op monitor interval="1s"
> 
> 
> The problem is that I have replaced on of those two server and the new
> one can't connect over eth1 just over bond1, so, when I move the
> resource to this new server it creates a virtual interface at eth1 and I
> never reach it.
> Is there a way to solve this situation ?

Simply don't specify "nic" if you don't have several interfaces in the
same network. The resource agent will find the interface by looking at
interfaces having already an IP in the same network configured ... no
matter whether they are simple/bond/bridges/vlan interfaces.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Thank you
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Corosync won't recover when a node fails

2013-10-04 Thread Andreas Kurz

   operation="start" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.6"
> transition-key="14:8:0:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> transition-magic="0:0;14:8:0:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> call-id="39" rc-code="0" op-status="0" interval="0"
> op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
>  operation_key="nfs_monitor_0" operation="monitor"
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.6"
> transition-key="6:4:7:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> transition-magic="0:0;6:4:7:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> call-id="31" rc-code="0" op-status="0" interval="0"
> op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
>  operation_key="nfs_monitor_5000" operation="monitor"
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.6"
> transition-key="2:8:0:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> transition-magic="0:0;2:8:0:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> call-id="40" rc-code="0" op-status="0" interval="5000"
> op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>
>   
>provider="heartbeat">
>  operation_key="nfs_ip_monitor_0" operation="monitor"
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.6"
> transition-key="5:4:7:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> transition-magic="0:0;5:4:7:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> call-id="30" rc-code="0" op-status="0" interval="0"
> op-digest="570cd25774b1ead32cb1840813adbe21"/>
>  operation_key="nfs_ip_monitor_1" operation="monitor"
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.6"
> transition-key="8:5:0:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> transition-magic="0:0;8:5:0:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> call-id="33" rc-code="0" op-status="0" interval="1"
> op-digest="bc929bfa78c3086ebd199cf0110b87bf"/>
>   
>provider="heartbeat">
>  operation_key="nfs_fs_monitor_0" operation="monitor"
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.6"
> transition-key="4:4:7:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> transition-magic="0:0;4:4:7:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> call-id="29" rc-code="0" op-status="0" interval="0"
> op-digest="c0a40c0015f71e8b20b5359e12f25eb5"/>
>   
> 
>   
> 
>  in_ccm="true" crmd="online" join="member"
> crm-debug-origin="do_update_resource" expected="member" shutdown="0">
>   
> 
>   
>  operation="monitor" crm-debug-origin="do_update_resource"
> crm_feature_set="3.0.6"
> transition-key="10:14:7:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> transition-magic="0:7;10:14:7:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> call-id="4" rc-code="7" op-status="0" interval="0" last-run="1380832563"
> last-rc-change="1380832563" exec-time="210" queue-time="0"
> op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
>   
>provider="heartbeat">
>  operation_key="nfs_ip_monitor_0" operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.6"
> transition-key="9:14:7:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> transition-magic="0:7;9:14:7:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> call-id="3" rc-code="7" op-status="0" interval="0" last-run="1380832563"
> last-rc-change="1380832563" exec-time="490" queue-time="0"
> op-digest="570cd25774b1ead32cb1840813adbe21"/>
>   
>provider="heartbeat">
>  operation_key="nfs_fs_monitor_0" operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.6"
> transition-key="8:14:7:1b4a3ae4-b013-45d1-a865-9b3b3deecf5f"
> transition-magic="0:7;8:14:7:1b4a3ae4-b013-45d1-a865-9b3b3deec

Re: [Pacemaker] Corosync won't recover when a node fails

2013-10-03 Thread Andreas Kurz

On 2013-10-03 22:12, David Parker wrote:
> Thanks, Andrew.  The goal was to use either Pacemaker and Corosync 1.x
> from the Debain packages, or use both compiled from source.  So, with
> the compiled version, I was hoping to avoid CMAN.  However, it seems the
> packaged version of Pacemaker doesn't support CMAN anyway, so it's moot.
> 
> I rebuilt my VMs from scratch, re-installed Pacemaker and Corosync from
> the Debian packages, but I'm still having an odd problem.  Here is the
> config portion of my CIB:
> 
> 
>   
>  value="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff"/>
>  name="cluster-infrastructure" value="openais"/>
>  name="expected-quorum-votes" value="2"/>
>  name="stonith-enabled" value="false"/>
>  name="no-quorum-policy" value="ignore"/>
>   
> 
> 
> I set no-quorum-policy=ignore based on the documentation example for a
> 2-node cluster.  But when Pacemaker starts up on the first node, the
> DRBD resource is in slave mode and none of the other resources are
> started (they depend on DRBD being master), and I see these lines in the
> log:
> 
> Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: unpack_config: On
> loss of CCM Quorum: Ignore
> Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start  
> nfs_fs   (test-vm-1 - blocked)
> Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start  
> nfs_ip   (test-vm-1 - blocked)
> Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start  
> nfs  (test-vm-1 - blocked)
> Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start  
> drbd_r0:0(test-vm-1)
> 
> I'm assuming the NFS resources show "blocked" because the resource they
> depend on is not in the correct state.
> 
> Even when the second node (test-vm-2) comes online, the state of these
> resources does not change.  I can shutdown and re-start Pacemaker over
> and over again on test-vm-2, but nothihg changes.  However... and this
> is where it gets weird... if I shut down Pacemaker on test-vm-1, then
> all of the resources immediately fail over to test-vm-2 and start
> correctly.  And I see these lines in the log:
> 
> Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: unpack_config: On
> loss of CCM Quorum: Ignore
> Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: stage6: Scheduling
> Node test-vm-1 for shutdown
> Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Start  
> nfs_fs   (test-vm-2)
> Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Start  
> nfs_ip   (test-vm-2)
> Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Start  
> nfs  (test-vm-2)
> Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Stop  
>  drbd_r0:0(test-vm-1)
> Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Promote
> drbd_r0:1(Slave -> Master test-vm-2)
> 
> After that, I can generally move the resources back and forth, and even
> fail them over by hard-failing a node, without any problems.  The real
> problem is that this isn't consistent, though.  Every once in a while,
> I'll hard-fail a node and the other one will go into this "stuck" state
> where Pacemaker knows it lost a node, but DRBD will stay in slave mode
> and the other resources will never start.  It seems to happen quite
> randomly.  Then, even if I restart Pacemaker on both nodes, or reboot
> them altogether, I run into the startup issue mentioned previously.
> 
> Any ideas?

Yes, share your complete resource configuration ;-)

Regards,
Andreas

> 
> Thanks,
> Dave
> 
> 
> 
> On Wed, Oct 2, 2013 at 1:01 AM, Andrew Beekhof  > wrote:
> 
> 
> On 02/10/2013, at 5:24 AM, David Parker  > wrote:
> 
> > Thanks, I did a little Googling and found the git repository for pcs.
> 
> pcs won't help you rebuild pacemaker with cman support (or corosync
> 2.x support) turned on though.
> 
> 
> >  Is there any way to make a two-node cluster work with the stock
> Debian packages, though?  It seems odd that this would be impossible.
> 
> it really depends how the debian maintainers built pacemaker.
> by the sounds of it, it only supports the pacemaker plugin mode for
> corosync 1.x
> 
> >
> >
> > On Tue, Oct 1, 2013 at 3:16 PM, Larry Brigman
> mailto:larry.brig...@gmail.com>> wrote:
> > pcs is another package you will need to install.
> >
> > On Oct 1, 2013 9:04 AM, "David Parker"  > wrote:
> > Hello,
> >
> > Sorry for the delay in my reply.  I've been doing a lot of
> experimentation, but so far I've had no luck.
> >
> > Thanks for the suggestion, but it seems I'm not able to use CMAN.
>  I'm running Debian Wheezy with Corosync and Pacemaker installed via
> apt-get.  When I installed CMAN and set up a cluster.conf file,
> Pacemaker refused to start and said that CMAN was not su

Re: [Pacemaker] Error when managing network with ping/pingd.

2013-09-18 Thread Andreas Kurz

On 2013-09-18 16:26, Francis SOUYRI wrote:
> Hello,
> 
> I take an example from Internet without monitor... Do you have a
> suggestion ?

There are a lot of examples ... like this one http://goo.gl/ZfKZeq

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Best regards.
> 
> Francis
> On 09/18/2013 03:54 PM, Andreas Kurz wrote:
>> On 2013-09-18 15:44, Francis SOUYRI wrote:
>>> Hello Andreas,
>>>
>>> I do not see what is wrong in my config.
>>
>> You have no "monitor" operation defined for your "ping-gateway"
>> resource, so in fact it does never execute the connectivity check but
>> only updates the "pingd" attriute value once on start.
>>
>> Regards,
>> Andreas
>>
>> -- 
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>
>>>
>>> >> validate-with="pacemaker-1.2" crm_feature_set="3.0.6"
>>> update-origin="noeud2.apec.fr" update-client="cibadmin"
>>> cib-last-written="Sat Sep 14 10:53:58 2013" have-quorum="1" dc-uuid="2">
>>>
>>>  
>>>
>>>  >> value="1.1.7-2.fc17-ee0730e13d124c3d58f00016c3376a1de5323cff"/>
>>>  >> name="cluster-infrastructure" value="corosync"/>
>>>  >> name="stonith-enabled" value="false"/>
>>>  >> name="last-lrm-refresh" value="1377611485"/>
>>>
>>>  
>>>  
>>>
>>>
>>>  
>>>  
>>>
>>>  >> type="IPaddr2">
>>>
>>>  >> value="192.168.1.250"/>
>>>  >> name="cidr_netmask" value="24"/>
>>>
>>>
>>>  >> name="monitor" timeout="5s"/>
>>>
>>>  
>>>  >> type="drbd.sh">
>>>
>>>  >> name="resource" value="named"/>
>>>
>>>
>>>  >> name="monitor" timeout="60s"/>
>>>
>>>  
>>>  >> provider="heartbeat" type="Filesystem">
>>>>> id="Filesystem_named-instance_attributes">
>>>  >> name="device" value="/dev/drbd0"/>
>>>  >> name="directory" value="/named"/>
>>>  >> name="fstype" value="ext4"/>
>>>  >> name="interval" value="120s"/>
>>>  >> name="timeout" value="60s"/>
>>>
>>>
>>>  >> name="monitor" timeout="60s"/>
>>>
>>>  
>>>  
>>>>> name="target-role" value="Started"/>
>>>  
>>>
>>>
>>>  >> type="IPaddr2">
>>>
>>>  >> value="192.168.1.252"/>
>>>  >> name="cidr_netmask" value="24"/>
>>>
>>>
>>>  >> name="monitor" timeout="5"/>
>>>
>>>  
>>>  >> type="drbd.sh">
>>>
>>>  >> name="resource" value="dhcpd"/>
>>>
>>>
>>>  >> name="monitor" timeout="60s"/>
>>>
>>>  
>>>  >> provider="heartbeat" type="Filesystem">
>>>>> id="Filesystem_dhcpd-instance_attributes">
>>>  >> name="device" value="/dev/drbd1"/>
>>>  >> name="directory" value="/dhcpd"/>
>>>  >> name="fstype" value="ext4"/>
>>>  >> name=&q

Re: [Pacemaker] Error when managing network with ping/pingd.

2013-09-18 Thread Andreas Kurz

On 2013-09-18 16:56, Francis SOUYRI wrote:
> Hello,
> 
> I updated ping-gateway like this but nothing change...

Hmm ... well, there is still no monitor operation you only added some
more attributes.

Try:

crm configure monitor ping-gateway 20

... or the equivalent in your preferred cluster management tool.

Regards,
Andreas

> 
> 
>  type="ping">
>   
>  name="host_list" value="192.168.1.1"/>
>  name="multiplier" value="1000"/>
>  name="depth" value="0"/>
>  name="timeout" value="20s"/>
>  name="interval" value="10"/>
>  name="debug" value="true"/>
>   
> 
> 
> 
> Best regards.
> 
> Francis
> 
> On 09/18/2013 04:26 PM, Francis SOUYRI wrote:
>> Hello,
>>
>>  I take an example from Internet without monitor... Do you have a
>> suggestion ?
>>
>> Best regards.
>>
>> Francis
>> On 09/18/2013 03:54 PM, Andreas Kurz wrote:
>>> On 2013-09-18 15:44, Francis SOUYRI wrote:
>>>> Hello Andreas,
>>>>
>>>> I do not see what is wrong in my config.
>>>
>>> You have no "monitor" operation defined for your "ping-gateway"
>>> resource, so in fact it does never execute the connectivity check but
>>> only updates the "pingd" attriute value once on start.
>>>
>>> Regards,
>>> Andreas
>>>
>>> -- 
>>> Need help with Pacemaker?
>>> http://www.hastexo.com/now
>>>
>>>
>>>>
>>>> >>> validate-with="pacemaker-1.2" crm_feature_set="3.0.6"
>>>> update-origin="noeud2.apec.fr" update-client="cibadmin"
>>>> cib-last-written="Sat Sep 14 10:53:58 2013" have-quorum="1"
>>>> dc-uuid="2">
>>>> 
>>>>   
>>>> 
>>>>   >>> name="dc-version"
>>>> value="1.1.7-2.fc17-ee0730e13d124c3d58f00016c3376a1de5323cff"/>
>>>>   >>> name="cluster-infrastructure" value="corosync"/>
>>>>   >>> name="stonith-enabled" value="false"/>
>>>>   >>> name="last-lrm-refresh" value="1377611485"/>
>>>> 
>>>>   
>>>>   
>>>> 
>>>> 
>>>>   
>>>>   
>>>> 
>>>>   >>> provider="heartbeat"
>>>> type="IPaddr2">
>>>> >>> id="IPaddr2_named-instance_attributes">
>>>>   >>> name="ip"
>>>> value="192.168.1.250"/>
>>>>   >>> id="IPaddr2_named-instance_attributes-cidr_netmask"
>>>> name="cidr_netmask" value="24"/>
>>>> 
>>>> 
>>>>   >>> name="monitor" timeout="5s"/>
>>>> 
>>>>   
>>>>   >>> type="drbd.sh">
>>>> 
>>>>   >>> name="resource" value="named"/>
>>>> 
>>>> 
>>>>   >>> name="monitor" timeout="60s"/>
>>>> 
>>>>   
>>>>   >>> provider="heartbeat" type="Filesystem">
>>>> >>> id="Filesystem_named-instance_attributes">
>>>>   >>> name="device" value="/dev/drbd0"/>
>>>>   >>> id="Filesystem_named-instance_attributes-directory"
>>>> name="directory" value="/named"/>
>>>>   >>> name="fstype" value="ext4"/>
>>>>   >>> id="Filesystem_named-instance_attributes-interval"
>>>> name="interval" value="120s"/>
>>>>   >>> name="timeout" value="60s"/>
>>>> 
>>>>

Re: [Pacemaker] Error when managing network with ping/pingd.

2013-09-18 Thread Andreas Kurz

   
>name="target-role" value="Started"/>
> 
>  type="IPaddr2">
>   
>  value="192.168.1.249"/>
>  name="cidr_netmask" value="24"/>
>   
>   
>  name="monitor" timeout="5s"/>
>   
>     
>  type="drbd.sh">
>   
>  name="resource" value="samba"/>
>  name="interval" value="120s"/>
>  name="timeout" value="60s"/>
>   
>   
>  name="monitor"/>
>   
> 
>  provider="heartbeat" type="Filesystem">
>   
>  name="device" value="/dev/drbd3"/>
>  name="directory" value="/samba"/>
>  name="fstype" value="ext4"/>
>  name="interval" value="120s"/>
>  name="timeout" value="60s"/>
>   
>   
>  name="monitor" timeout="60s"/>
>   
> 
>   
>   
> 
>value="true"/>
> 
>  type="ping">
>   
>  name="host_list" value="192.168.1.1"/>
>  name="multiplier" value="1000"/>
>   
> 
>   
> 
> 
>node="noeud1.apec.fr" rsc="dhcpd" score="50"/>
>node="noeud2.apec.fr" rsc="named2" score="50"/>
>node="noeud2.apec.fr" rsc="samba" score="50"/>
>node="noeud1.apec.fr" rsc="named" score="50"/>
>   
> 
>operation="defined"/>
> 
>   
>   
> 
>operation="defined"/>
> 
>   
>   
> 
>operation="defined"/>
> 
>   
>   
> 
>operation="defined"/>
> 
>   
> 
> 
>   
>  name="resource-stickiness" value="100"/>
>   
> 
>   
> 
> 
> Best regards.
> 
> Francis
> 
> 
> On 09/18/2013 11:00 AM, Andreas Kurz wrote:
>> On 2013-09-17 16:25, Francis SOUYRI wrote:
>>> Hello,
>>>
>>> Thank for your reply Andreas, I set the multiplier to 1000, cut the
>>> netword nothing append.
>>
>> Well, the ping resource-agent updates the value on every monitor event
>>  looking again on your configuration, guess what operation is
>> missing? ;-)
>>
>> Regards,
>> Andreas
>>
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Howto test/simulate the reaction of the cluster to node up and down

2013-09-18 Thread Andreas Kurz

On 2013-09-18 15:08, Andreas Mock wrote:
> Hi all,
> 
> really nobody here with deeper experience of crm_simulate?
> Or with a hint for good documentation?

What Pacemaker version are you using? I did a quick test here on older
1.1.6 and 1.1.7 clusters and they show a nice output on "crm_simulate
-Ls -u testnode" with transitions and scores.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Best regards
> Andreas Mock
> 
> 
> -Ursprüngliche Nachricht-
> Von: Andreas Mock [mailto:andreas.m...@web.de] 
> Gesendet: Dienstag, 17. September 2013 13:38
> An: 'The Pacemaker cluster resource manager'
> Betreff: [Pacemaker] Howto test/simulate the reaction of the cluster to node
> up and down
> 
> Hi all,
> 
> I have the problem that after a node rejoins the cluster some
> resources are move back to that node. 
> Now I want to see the calculated scores to see where I do
> have to adjust the stickyness to get the behaviour I like.
> 
> I'm not sure how to use crm_simulate to get these values.
> When both nodes are online I can simulate a node down
> by crm_simulate -Ls -d .
> But how do I simulate thr transition from a state where one
> node is down? When I bring down a node by 'service pacemaker stop'
> and try a crm_simulate -Ls -u  I don't see resource transitions.
> I only see:
> --8<
> Performing requested modifications
>  + Bringing node dis04 online
> --8<
> 
> Any hints appreciated.
> 
> Best regards
> Andreas Mock
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Error when managing network with ping/pingd.

2013-09-18 Thread Andreas Kurz

On 2013-09-17 16:25, Francis SOUYRI wrote:
> Hello,
> 
> Thank for your reply Andreas, I set the multiplier to 1000, cut the
> netword nothing append.

Well, the ping resource-agent updates the value on every monitor event
 looking again on your configuration, guess what operation is
missing? ;-)

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> I have these score:
> 
> Resource   Score Node   Stickiness #Fail
> IPaddr2_dhcpd  1000  noeud2.apec.fr 1000
> IPaddr2_dhcpd  1350  noeud1.apec.fr 1000
> IPaddr2_named  1000  noeud2.apec.fr 1000
> IPaddr2_named  1350  noeud1.apec.fr 1000
> IPaddr2_named2 1000  noeud1.apec.fr 1000
> IPaddr2_named2 1350  noeud2.apec.fr 1000
> IPaddr2_samba  1000  noeud1.apec.fr 1000
> IPaddr2_samba  1350  noeud2.apec.fr 1000
> ping-gateway:0 100   noeud2.apec.fr 1000
> ping-gateway:0 -INFINITY noeud1.apec.fr 1000
> ping-gateway:1 0 noeud2.apec.fr 1000
> ping-gateway:1 100   noeud1.apec.fr 1000
> 
> Best regards.
> 
> Francis
> 
> On 09/17/2013 10:21 AM, Andreas Kurz wrote:
>> On 2013-09-17 09:45, Francis SOUYRI wrote:
>>> Hello,
>>>
>>> Some help about my problem ?
>>>
>>> I have a corosync/pacemaker with 2 nodes and 2 nets by nodes,
>>> 192.168.1.0/24 for cluster access, 10.1.1.0/24 for drbd in bond, both
>>> used by corosync.
>>> I try to used ocf:pacemaker:ping to monitor the 192.168.1.0/24 I have
>>> the configuration below, but when I remove the cable of the noeud1 the
>>> named group resource do not migrate to noeud2.
>>
>> Looks like the extra score 100 from pingd of node2 is not high enough to
>> overrule the location constraints with score 50 and the
>> resource-stickiness of 100 ... you can e.g. increase the "multiplier"
>> value of your pingd resource to 1000 to be sure it overrules these
>> constraints and the stickiness.
>>
>> Regards,
>> Andreas
>>
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Error when managing network with ping/pingd.

2013-09-17 Thread Andreas Kurz

On 2013-09-17 09:45, Francis SOUYRI wrote:
> Hello,
> 
> Some help about my problem ?
> 
> I have a corosync/pacemaker with 2 nodes and 2 nets by nodes,
> 192.168.1.0/24 for cluster access, 10.1.1.0/24 for drbd in bond, both
> used by corosync.
> I try to used ocf:pacemaker:ping to monitor the 192.168.1.0/24 I have
> the configuration below, but when I remove the cable of the noeud1 the
> named group resource do not migrate to noeud2.

Looks like the extra score 100 from pingd of node2 is not high enough to
overrule the location constraints with score 50 and the
resource-stickiness of 100 ... you can e.g. increase the "multiplier"
value of your pingd resource to 1000 to be sure it overrules these
constraints and the stickiness.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] mysql ocf resource agent - resource stays unmanaged if binary unavailable

2013-05-17 Thread Andreas Kurz

On 2013-05-17 00:24, Vladimir wrote:
> Hi,
> 
> our pacemaker setup provides mysql resource using ocf resource agent.
> Today I tested with my colleagues forcing mysql resource to fail. I
> don't understand the following behaviour. When I remove the mysqld_safe
> binary (which path is specified in crm config) from one server and
> moving the mysql resource to this server, the resource will not fail
> back and stays in the "unmanaged" status. We can see that the function
> check_binary(); is called within the mysql ocf resource agent and
> exists with error code "5". The fail-count gets raised to INFINITY and
> pacemaker tries to "stop" the resource fails. This results in a
> "unmanaged" status.
> 
> How to reproduce:
> 
> 1. mysql resource is running on node1
> 2. on node2 mv /usr/bin/mysqld_safe{,.bak}
> 3. crm resource move group-MySQL node2
> 4. observe corosync.log and crm_mon
> 
> # cat /var/log/corosync/corosync.log
> [...]
> May 16 10:53:41 node2 lrmd: [1893]: info: operation start[119] on
> res-MySQL-IP1 for client 1896: pid 5137 exited with return code 0 May
> 16 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation
> res-MySQL-IP1_start_0 (call=119, rc=0, cib-update=98, confirmed=true)
> ok May 16 10:53:41 node2 crmd: [1896]: info: do_lrm_rsc_op: Performing
> key=94:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e
> op=res-MySQL-IP1_monitor_3 ) May 16 10:53:41 node2 lrmd: [1893]:
> info: rsc:res-MySQL-IP1 monitor[120] (pid 5222) May 16 10:53:41 node2
> crmd: [1896]: info: do_lrm_rsc_op: Performing
> key=96:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_start_0
> ) May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL start[121]
> (pid 5223) May 16 10:53:41 node2 lrmd: [1893]: info: RA output:
> (res-MySQL:start:stderr) 2013/05/16_10:53:41 ERROR: Setup problem:
> couldn't find command: /usr/bin/mysqld_safe
> 
> May 16 10:53:41 node2 lrmd: [1893]: info: operation start[121] on
> res-MySQL for client 1896: pid 5223 exited with return code 5 May 16
> 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation
> res-MySQL_start_0 (call=121, rc=5, cib-update=99, confirmed=true) not
> installed May 16 10:53:41 node2 lrmd: [1893]: info: operation
> monitor[120] on res-MySQL-IP1 for client 1896: pid 5222 exited with
> return code 0 May 16 10:53:41 node2 crmd: [1896]: info:
> process_lrm_event: LRM operation res-MySQL-IP1_monitor_3 (call=120,
> rc=0, cib-update=100, confirmed=false) ok May 16 10:53:41 node2 attrd:
> [1894]: notice: attrd_ais_dispatch: Update relayed from node1 May 16
> 10:53:41 node2 attrd: [1894]: notice: attrd_trigger_update: Sending
> flush op to all hosts for: fail-count-res-MySQL (INFINITY) May 16
> 10:53:41 node2 attrd: [1894]: notice: attrd_perform_update: Sent update
> 44: fail-count-res-MySQL=INFINITY May 16 10:53:41 node2 attrd: [1894]:
> notice: attrd_ais_dispatch: Update relayed from node1 May 16 10:53:41
> node2 attrd: [1894]: notice: attrd_trigger_update: Sending flush op to
> all hosts for: last-failure-res-MySQL (1368694421) May 16 10:53:41
> node2 attrd: [1894]: notice: attrd_perform_update: Sent update 47:
> last-failure-res-MySQL=1368694421 May 16 10:53:41 node2 lrmd: [1893]:
> info: cancel_op: operation monitor[117] on res-DRBD-MySQL:1 for client
> 1896, its parameters: drbd_resource=[mysql] CRM_meta_role=[Master]
> CRM_meta_timeout=[2] CRM_meta_name=[monitor]
> crm_feature_set=[3.0.5] CRM_meta_notify=[true]
> CRM_meta_clone_node_max=[1] CRM_meta_clone=[1] CRM_meta_clone_max=[2]
> CRM_meta_master_node_max=[1] CRM_meta_interval=[29000]
> CRM_meta_globally_unique=[false] CRM_meta_master_max=[1]  cancelled May
> 16 10:53:41 node2 crmd: [1896]: info: send_direct_ack: ACK'ing resource
> op res-DRBD-MySQL:1_monitor_29000 from
> 3:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e:
> lrm_invoke-lrmd-1368694421-57 May 16 10:53:41 node2 crmd: [1896]: info:
> do_lrm_rsc_op: Performing
> key=8:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_stop_0 )
> May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL stop[122] (pid
> 5278) [...]
> 
> I can not figure out why the fail-count gets raised to INFINITY and
> especially why pacemaker tries to stop the resource after failing.
> Shouldn't it be the best for the resource to fail back to another node
> instead of resulting in a "unmanaged" status on the node? is it
> possible to force this behavior in any way?

By default start-failures are fatal and raising the fail-count to
INFINITY disallows future starts on this node unless the resource and so
its fail-count is cleaned.

On a start failure Pacemaker tries to stop the resource to be sure it is
really not started or somewhere in-between ... stop fails also in your
case and cluster stucks and sets the resource into unmanaged mode.

Why? Because you obviously have no stonith configured that could make
sure the resource is really stopped by fencing that node.

Solution for your problem: correctly configure stonith and enable it in
your cluster

Bes

Re: [Pacemaker] Stonith: How to avoid deathmatch cluster partitioning

2013-05-17 Thread Andreas Kurz

On 2013-05-16 11:01, Lars Marowsky-Bree wrote:
> On 2013-05-15T22:55:43, Andreas Kurz  wrote:
> 
>> start-delay is an option of the monitor operation ... in fact means
>> "don't trust that start was successfull, wait for the initial monitor
>> some more time"
> 
> It can be used on start here though to avoid exactly this situation; and
> it works fine for that, effectively being equivalent to the "delay"
> option on stonith (since the start always precedes the fence).

Hmm ... looking at the configuration there are two stonith resources,
each one locked to a node and they are started all the time so I can't
see how that would help here in case of a split-brain ... but please
correct me if I miss something here.

> 
>> The problem is, this would only make sense for one single stonith
>> resource that can fence more nodes. In case of a split-brain that would
>> delay the start on that node where the stonith resource was not running
>> before and gives that node a "penalty".
> 
> Sure. In a split-brain scenario, one side will receive a penalty, that's
> the whole point of this exercise. In particular for the external/sbd
> agent.

So you are confirming my explanation, thanks ;-)

Best regards,
Andreas

> 
> Or by grouping all fencing resources to always run on one node; if you
> don't have access to RHT fence agents, for example.
> 
> external/sbd also has code to avoid a death-match cycle in case of
> persistent split-brain scenarios now; after a reboot, the node that was
> fenced will not join unless the fence is cleared first.
> 
> (The RHT world calls that "unfence", I believe.)
> 
> That should be a win for the fence_sbd that I hope to get around to
> sometime in the next few months, too ;-)
> 
>> In your example with two stonith resources running all the time,
>> Digimer's suggestion is a good idea: use one of the redhat fencing
>> agents, most of them have some sort of "stonith-delay" parameter that
>> you can use with one instance.
> 
> It'd make sense to have logic for this embedded at a higher level,
> somehow; the problem is all too common.
> 
> Of course, it is most relevant in scenarios where "split brain" is a
> significantly higher probability than "node down". Which is true for
> most test scenarios (admins love yanking cables), but in practice, it's
> mostly truly the node down.
> 
> 
> Regards,
> Lars
> 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Stonith: How to avoid deathmatch cluster partitioning

2013-05-17 Thread Andreas Kurz

On 2013-05-16 11:31, Klaus Darilion wrote:
> Hi Andreas!
> 
> On 15.05.2013 22:55, Andreas Kurz wrote:
>> On 2013-05-15 15:34, Klaus Darilion wrote:
>>> On 15.05.2013 14:51, Digimer wrote:
>>>> On 05/15/2013 08:37 AM, Klaus Darilion wrote:
>>>>> primitive st-pace1 stonith:external/xen0 \
>>>>>   params hostlist="pace1" dom0="xentest1" \
>>>>>   op start start-delay="15s" interval="0"
>>>>
>>>> Try;
>>>>
>>>> primitive st-pace1 stonith:external/xen0 \
>>>>   params hostlist="pace1" dom0="xentest1" delay="15" \
>>>>   op start start-delay="15s" interval="0"
>>>>
>>>> The idea here is that, when both nodes lose contact and initiate a
>>>> fence, 'st-pace1' will get a 15 second reprieve. That is, 'st-pace2'
>>>> will wait 15 seconds before trying to fence 'st-pace1'. If st-pace1 is
>>>> still alive, it will fence 'st-pace2' without delay, so pace2 will be
>>>> dead before it's timer expires, preventing a dual-fence. However, if
>>>> pace1 really is dead, pace2 will fence it and recovery, just with a 15
>>>> second delay.
>>>
>>> Sounds good, but pacemaker does not accept the parameter:
>>>
>>> ERROR: st-pace1: parameter delay does not exist
>>
>> start-delay is an option of the monitor operation ... in fact means
>> "don't trust that start was successfull, wait for the initial monitor
>> some more time"
>>
>> The problem is, this would only make sense for one single stonith
>> resource that can fence more nodes. In case of a split-brain that would
>> delay the start on that node where the stonith resource was not running
>> before and gives that node a "penalty".
> 
> Thanks for the clarification. I already thought that the start-delay
> workaround is not useful in my setup.
> 
>> In your example with two stonith resources running all the time,
>> Digimer's suggestion is a good idea: use one of the redhat fencing
>> agents, most of them have some sort of "stonith-delay" parameter that
>> you can use with one instance.
> 
> I found it somehow confusing that a generic parameter (delay is useful
> for all stonith agents) is implemented in the agent, not in pacemaker.
> Further, downloading the RH source RPMS and extracting the agents is
> also quite cumbersome.

If you are on an Ubuntu >=12.04 or Debian Wheezy the fence-agents
package is available ... so no need for extra work ;-)

> 
> I think I will add the delay parameter to the relevant fencing agent
> myself. I guess I also have increase the stonith-timeout and add the
> configured delay.
> 
> Do you know how to submit patches for the stonith agents?

Sending them e.g. to the linux-ha-dev mailinglist is an option.

Best regards,
Andreas

> 
> Thanks
> Klaus
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] pacemaker colocation after one node is down

2013-05-16 Thread Andreas Kurz

On 2013-05-16 13:42, Wolfgang Routschka wrote:
> Hi Andreas,
> 
> thank you for your answer.
> 
> solutions is one coloation with -score

ah, yes  only _one_ of them with a non-negative value is needed.
Scores of all constraints are added up.

Regards,
Andreas

> 
> colocation cl_g_ip-address_not_on_r_postfix -1: g_ip-address r_postfix
> 
> Greetings Wolfgang
> 
> 
> On 2013-05-15 21:30, Wolfgang Routschka wrote:
>> Hi everybody,
>>  
>> one question today about colocation rule on a 2-node cluster on
>> scientific linux 6.4 and pacemaker/cman.
>>  
>> 2-Node Cluster
>>  
>> first node haproxy load balancer proxy service - second node with
>> postfix service.
>>  
>> colocation for running a group called g_ip-address (haproxy lsb-resouce
>> and ipaddress resource) on the other node of the postfix server is
>>  
>> cl_g_ip-address_not_on_r_postfix -inf: g_ip-address r_postfix
> 
> -INF == never-ever ;-)
> 
>>  
>> The problem is now that the node with haproxy is down pacemaker cannot
>> move/migrate the services to the other node -ok second colocation with
>> lower score but it doesn?t works for me
>>  
>> colocation cl_g_ip-address_on_r_postfix -1: g_ip-address r_postfix
>>  
>> Whats my fault in these section?
> 
> Hard to say without seeing the rest of your configuration, but you can
> run "crm_simulate -s -L" to see all the scores taken into account.
> 
> Regards,
> Andreas
> 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

2013-05-15 Thread Andreas Kurz

On 2013-05-15 20:44, Andrew Widdersheim wrote:
> Sorry to bring up old issues but I am having the exact same problem as the 
> original poster. A simultaneous disconnect on my two node cluster causes the 
> resources to start to transition to the other node but mid flight the 
> transition is aborted and resources are started again on the original node 
> when the cluster realizes connectivity is same between the two nodes.
> 
> I have tried various dampen settings without having any luck. Seems like the 
> nodes report the outages at slightly different times which results in a 
> partial transition of resources instead of waiting to know the connectivity 
> of all of the nodes in the cluster before taking action which is what I would 
> have thought dampen would help solve.
> 

You have some logs for us?

> Ideally the cluster wouldn't start the transition if another cluster node is 
> having a connectivity issue as well and connectivity status is shared between 
> all cluster nodes. Find my configuration below. Let me know there is 
> something I can change to fix or if this behavior is expected.
> 
> primitive p_drbd ocf:linbit:drbd \
> params drbd_resource="r1" \
> op monitor interval="30s" role="Slave" \
> op monitor interval="10s" role="Master"
> primitive p_fs ocf:heartbeat:Filesystem \
> params device="/dev/drbd/by-res/r1" directory="/drbd/r1" 
> fstype="ext4" options="noatime" \
> op start interval="0" timeout="60s" \
> op stop interval="0" timeout="180s" \
> op monitor interval="30s" timeout="40s"
> primitive p_mysql ocf:heartbeat:mysql \
> params binary="/usr/libexec/mysqld" config="/drbd/r1/mysql/my.cnf" 
> datadir="/drbd/r1/mysql" \
> op start interval="0" timeout="120s" \
> op stop interval="0" timeout="120s" \
> op monitor interval="30s" \
> meta target-role="Started"
> primitive p_ping ocf:pacemaker:ping \
> params host_list="192.168.5.1" dampen="30s" multiplier="1000" 
> debug="true" \
> op start interval="0" timeout="60s" \
> op stop interval="0" timeout="60s" \
> op monitor interval="5s" timeout="10s"
> group g_mysql_group p_fs p_mysql \
> meta target-role="Started"
> ms ms_drbd p_drbd \
> meta notify="true" master-max="1" clone-max="2" target-role="Started"
> clone cl_ping p_ping
> location l_connected g_mysql \
> rule $id="l_connected-rule" pingd: defined pingd
> colocation c_mysql_on_drbd inf: g_mysql ms_drbd:Master
> order o_drbd_before_mysql inf: ms_drbd:promote g_mysql:start
> property $id="cib-bootstrap-options" \
> dc-version="1.1.6-1.el6-8b6c6b9b6dc2627713f870850d20163fad4cc2a2" \
> cluster-infrastructure="Heartbeat" \

Hmm ... you compiled your own Pacemaker version that supports Heartbeat
on RHEL6?

Best regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> no-quorum-policy="ignore" \
> stonith-enabled="false" \
> cluster-recheck-interval="5m" \
> last-lrm-refresh="1368632470"
> rsc_defaults $id="rsc-options" \
> migration-threshold="5" \
> resource-stickiness="200"   
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] pacemaker colocation after one node is down

2013-05-15 Thread Andreas Kurz

On 2013-05-15 21:30, Wolfgang Routschka wrote:
> Hi everybody,
>  
> one question today about colocation rule on a 2-node cluster on
> scientific linux 6.4 and pacemaker/cman.
>  
> 2-Node Cluster
>  
> first node haproxy load balancer proxy service - second node with
> postfix service.
>  
> colocation for running a group called g_ip-address (haproxy lsb-resouce
> and ipaddress resource) on the other node of the postfix server is
>  
> cl_g_ip-address_not_on_r_postfix -inf: g_ip-address r_postfix

-INF == never-ever ;-)

>  
> The problem is now that the node with haproxy is down pacemaker cannot
> move/migrate the services to the other node -ok second colocation with
> lower score but it doesn´t works for me
>  
> colocation cl_g_ip-address_on_r_postfix -1: g_ip-address r_postfix
>  
> Whats my fault in these section?

Hard to say without seeing the rest of your configuration, but you can
run "crm_simulate -s -L" to see all the scores taken into account.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

>  
> How can I migrate my group to the other if the master node for it is dead?
>  
> Greetings Wolfgang
>  
>  
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Stonith: How to avoid deathmatch cluster partitioning

2013-05-15 Thread Andreas Kurz

On 2013-05-15 15:34, Klaus Darilion wrote:
> On 15.05.2013 14:51, Digimer wrote:
>> On 05/15/2013 08:37 AM, Klaus Darilion wrote:
>>> primitive st-pace1 stonith:external/xen0 \
>>>  params hostlist="pace1" dom0="xentest1" \
>>>  op start start-delay="15s" interval="0"
>>
>> Try;
>>
>> primitive st-pace1 stonith:external/xen0 \
>>  params hostlist="pace1" dom0="xentest1" delay="15" \
>>  op start start-delay="15s" interval="0"
>>
>> The idea here is that, when both nodes lose contact and initiate a
>> fence, 'st-pace1' will get a 15 second reprieve. That is, 'st-pace2'
>> will wait 15 seconds before trying to fence 'st-pace1'. If st-pace1 is
>> still alive, it will fence 'st-pace2' without delay, so pace2 will be
>> dead before it's timer expires, preventing a dual-fence. However, if
>> pace1 really is dead, pace2 will fence it and recovery, just with a 15
>> second delay.
> 
> Sounds good, but pacemaker does not accept the parameter:
> 
>ERROR: st-pace1: parameter delay does not exist

start-delay is an option of the monitor operation ... in fact means
"don't trust that start was successfull, wait for the initial monitor
some more time"

The problem is, this would only make sense for one single stonith
resource that can fence more nodes. In case of a split-brain that would
delay the start on that node where the stonith resource was not running
before and gives that node a "penalty".

In your example with two stonith resources running all the time,
Digimer's suggestion is a good idea: use one of the redhat fencing
agents, most of them have some sort of "stonith-delay" parameter that
you can use with one instance.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> 
> The syntax suggested by you assumes that "delay" is a parameter accepted
> by the stonith resource. But this is not the case. Also "grep delay
> /usr/lib/stonith/plugins/external/*" does not reveal a single stonith
> resource which accepts this parameter.
> 
> Further, it would make sense to have "delay" as Pacemaker parameter. I
> also tried
>   primitive st-pace1 stonith:external/xen0 delay="15" \
> params hostlist="pace1" dom0="xentest1" \
> op start start-delay="15s" interval="0"
> but this also gives syntax errors.
> 
> Any other hints?
> 
> thanks
> Klaus
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] help with DRBD/pacemaker

2013-05-13 Thread Andreas Kurz

On 2013-05-13 11:54, Michael Schwartzkopff wrote:
> Hi,
> 
>  
> 
> I experienced a strange phenomena when setting up DRBD 8.4.2 with
> pacemaker 1.1.9.
> 
>  
> 
> I can set up a dual primary DRBD manually without any problems. Now shut
> down one node, i.e. I demote the DRBD and the shut is "down".
> 
>  
> 
> When I start the cluster on the first node everything if fine. The CRM
> detects that the DRBD is promoted. When I start pacemaker in the second
> cluster node I see
> 
>  
> 
> - pacemaker starts DRBD nicely on the second node. It detects the
> correct generation identifiers and the operations start and notify (2x)
> show success.
> 
>  
> 
> Now the cluster wants to promote the second node and it generates a new
> generation identified:
> 
> 2013-05-13T10:41:44.862939+02:00 xray kernel: block drbd0: role(
> Secondary -> Primary )
> 
> 2013-05-13T10:41:44.862957+02:00 xray kernel: block drbd0: new current
> UUID 35F2E1EB875F145F:2897B49E931464BC:6FE576F8519973CA:6FE476F8519973CB
> 
>  
> 
> Then DRBD starts the connection and DRBD detects the different primary
> generation identifiers. The result is a split brain.
> 
>  
> 
> Any idea why DRBD 8.4.2 does not start the network connection when
> starting the DRBD on the second node? Why does the cluster promote the
> DRBD (and thus generating the new GI) WITHOUT the connection between the
> DRBD?

IIRC we already had that discussion her on the list ... Pacemaker is too
fast and is ready to promote faster then DRBD can connect (and nothing
"forbids" it to do do so). But with correct resource-level fencing setup
in DRBD that should work.

Regards,
Andreas

-
Need help with Pacemaker?
http://www.hastexo.com/now

> 
>  
> 
> Thanks for any help.
> 
>  
> 
> -- 
> 
> Dr. Michael Schwartzkopff
> 
> Guardinistr. 63
> 
> 81375 München
> 
>  
> 
> Tel: (0163) 172 50 98
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [Problem][crmsh]The designation of the 'ordered' attribute becomes the error.

2013-04-01 Thread Andreas Kurz

Hi Dejan,

On 2013-03-06 11:59, Dejan Muhamedagic wrote:
> Hi Hideo-san,
> 
> On Wed, Mar 06, 2013 at 10:37:44AM +0900, renayama19661...@ybb.ne.jp wrote:
>> Hi Dejan,
>> Hi Andrew,
>>
>> As for the crm shell, the check of the meta attribute was revised with the 
>> next patch.
>>
>>  * http://hg.savannah.gnu.org/hgweb/crmsh/rev/d1174f42f4b3
>>
>> This patch was backported in Pacemaker1.0.13.
>>
>>  * 
>> https://github.com/ClusterLabs/pacemaker-1.0/commit/fa1a99ab36e0ed015f1bcbbb28f7db962a9d1abc#shell/modules/cibconfig.py
>>
>> However, the ordered,colocated attribute of the group resource is treated as 
>> an error when I use crm Shell which adopted this patch.
>>
>> --
>> (snip)
>> ### Group Configuration ###
>> group master-group \
>> vip-master \
>> vip-rep \
>> meta \
>> ordered="false"
>> (snip)
>>
>> [root@rh63-heartbeat1 ~]# crm configure load update test2339.crm 
>> INFO: building help index
>> crm_verify[20028]: 2013/03/06_17:57:18 WARN: unpack_nodes: Blind faith: not 
>> fencing unseen nodes
>> WARNING: vip-master: specified timeout 60s for start is smaller than the 
>> advised 90
>> WARNING: vip-master: specified timeout 60s for stop is smaller than the 
>> advised 100
>> WARNING: vip-rep: specified timeout 60s for start is smaller than the 
>> advised 90
>> WARNING: vip-rep: specified timeout 60s for stop is smaller than the advised 
>> 100
>> ERROR: master-group: attribute ordered does not exist  -> WHY?
>> Do you still want to commit? y
>> --
>>
>> If it chooses `yes` by a confirmation message, it is reflected, but it is a 
>> problem that error message is displayed.
>>  * The error occurs in the same way when I appoint colocated attribute.
>> AndI noticed that there was not explanation of ordered,colocated of the 
>> group resource in online help of Pacemaker.
>>
>> I think that the designation of the ordered,colocated attribute should not 
>> become the error in group resource.
>> In addition, I think that ordered,colocated should be added to online help.
> 
> These attributes are not listed in crmsh. Does the attached patch
> help?

Dejan, will this patch for the missing "ordered" and "collocated" group
meta-attribute be included in the next crmsh release? ... can't see the
patch in the current tip.

Thanks & Regards,
Andreas

> 
> Thanks,
> 
> Dejan
>>
>> Best Regards,
>> Hideo Yamauchi.
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] pacemaker node stuck offline

2013-03-25 Thread Andreas Kurz

On 2013-03-22 03:39, pacema...@feystorm.net wrote:
> 
> On 03/21/2013 11:15 AM, Andreas Kurz wrote:
>> On 2013-03-21 14:31, Patrick Hemmer wrote:
>>> I've got a 2-node cluster where it seems last night one of the nodes
>>> went offline, and I can't see any reason why.
>>>
>>> Attached are the logs from the 2 nodes (the relevant timeframe seems to
>>> be 2013-03-21 between 06:05 and 06:10).
>>> This is on ubuntu 12.04
> 
>> Looks like your non-redundant cluster-communication was interrupted at
>> around that time for whatever reason and your cluster split-brained.
> 
>> Does the drbd-replication use a different network-connection? If yes,
>> why not using it for a redundant ring setup ... and you should use
> STONITH.
> 
>> I also wonder why you have defined "expected_votes='1'" in your
>> cluster.conf.
> 
>> Regards,
>> Andreas
> But shouldn't it have recovered? The node shows as "OFFLINE", even
> though it's clearly communicating with the rest of the cluster. What is
> the procedure for getting the node back online. Anything other than
> bouncing pacemaker?

Looks like the cluster has some troubles trying to rejoin the two DCs
after the split-brain. Try to stop cman/Pacemaker on i-3307d96b and
clean there the /var/lib/heartbeat/crm directory so it starts with an
empty configuration and receives the latest updates from i-a706d8ff.

> 
> Unfortunately no to the different network connection for drbd. These are
> 2 EC2 instances, so redundant connections aren't available. Though since
> it is EC2, I could set up a STONITH to whack the other instance. The
> only problem here would be a race condition. The EC2 api for shutting
> down or rebooting an instance isn't instantaneous. Both nodes could end
> up sending the signal to reboot the other node.

Yeah, you would need to add a very generous start-timeout to the monitor
operation of the stonith primitive ... but it works ;-)

> 
> As for expected_votes=1, it's because it's a two-node cluster. Though I
> apparently forgot to set the `two_node` attribute :-(

Those two parameters should not be needed for a cman/pacemaker cluster,
you can tell pacemaker to ignore loss of quorum.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] issues when installing on pxe booted environment

2013-03-25 Thread Andreas Kurz

On 2013-03-22 19:31, John White wrote:
> Hello Folks,
>   We're trying to get a corosync/pacemaker instance going on a 4 node 
> cluster that boots via pxe.  There have been a number of state/file system 
> issues, but those appear to be *mostly* taken care of thus far.  We're 
> running into an issue now where cib just isn't staying up with errors akin to 
> the following (sorry for the lengthy dump, note the attrd and cib connection 
> errors).  Any ideas would be greatly appreciated: 
> 
> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG 
> parser context
> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: 
> /usr/lib64/heartbeat/attrd 
> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed 
> active directory to /var/lib/heartbeat/cores/hacluster
> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type 
> is: 'corosync'
> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting 
> to cluster infrastructure: corosync
> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not 
> connect to the Cluster Process Group API: 2
> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active
> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates
> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: 
> /usr/lib64/heartbeat/pengine 
> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed 
> active directory to /var/lib/heartbeat/cores/hacluster
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old 
> instances of pengine
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine

That "/var/run/crm" directory is available and owned by
hacluster.haclient ... and writable by at least the hacluster user?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child 
> process attrd exited (pid=25841, rc=100)
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child 
> process attrd no longer wishes to be respawned
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node 
> n0014.lustre now has process list: 00110312 (was 
> 00111312)
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
> init_client_ipc_comms_nodispatch: Could not init comms on: 
> /var/run/crm/pengine
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding 
> fd=4 to mainloop
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: 
> Connection to 'corosync': established
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating 
> entry for node n0014.lustre/247988234
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
> n0014.lustre now has id: 247988234
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 
> is now known as n0014.lustre
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: 
> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk
> Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: /usr/lib64/heartbeat/crmd 
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: 
> Channel 0x995530 connected: 1 children
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng 
> mainloop
> Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed 
> active directory to /var/lib/heartbeat/cores/hacluster
> Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: 
> a02c0f19a00c1eb2527ad38f146ebc0834814558
> Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing I_STARTUP: 
> [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
> #011// A_LOG   
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
> #011// A_STARTUP
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal 
> Handlers
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and LRM 
> objects
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node 
> n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 
> proc=00110312 (new)
> Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added 
> signal handler for signal 17
> Mar 22 11:25:1

Re: [Pacemaker] Resource is Too Active (on both nodes)

2013-03-25 Thread Andreas Kurz

On 2013-03-22 21:35, Mohica Jasha wrote:
> Hey,
> 
> I have two cluster nodes.
> 
> I have a service process which is prone to crash and takes a very long
> time to start. 
> Since the service process takes a long time to start I have the service
> process running on both nodes, but only the active node with the virtual
> IP serves the incoming requests.
> 
> On both nodes, I have a cron job which periodically checks if the
> service process is up and if not it starts the service.
> 
> I want pacemaker to periodically check if the service is down on the
> active node and if so, it switches the virtual IP to the second node
> (without starting or stopping the my service)
> 
> I have the following configuration:
> 
> primitive clusterIP ocf:heartbeat:IPaddr2 \
> params ip="10.0.1.247" \
> op monitor interval="10s" timeout="20s"
> 
> primitive serviceMonitoring ocf:serviceMonitoring:serviceMonitoring 
> params op monitor interval="10s" timeout="20s"
> 
> colocation HACluster inf: serviceMonitoring clusterIP
> order serviceMonitoring-after-clusterIP inf: clusterIP serviceMonitoring
> 
> My serviceMonitoring resource doesn't do anything other than checking
> the state of the service process. I get the following in the log file:
> 
> Mar 05 15:07:59 [1543] ha1 pengine:   notice: unpack_rsc_op: Operation
> monitor found resource serviceMonitoring active on ha2
> Mar 05 15:07:59 [1543] ha1 pengine:   notice: unpack_rsc_op: Operation
> monitor found resource serviceMonitoring active on ha1
> Mar 05 15:07:59 [1543] ha1 pengine:error: native_create_actions:
> Resource serviceMonitoring (ocf:: serviceMonitoring) is active on 2
> nodes attempting recovery
> Mar 05 15:07:59 [1543] ha1 pengine:  warning: native_create_actions: See
> http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
> 
> So it seems that pacemaker calls the monitor method of the
> serviceMonitoring resource on both nodes.

Yes, it does a probing of the resources on all nodes ... clone your
serviceMonitoring resource and set it into unmanaged mode, that should
give you the desired behavior ... or simply clone it and let Pacemaker
do the complete management and go without your cron-check-restart magic.

Regards,
Andreas

> 
> Any idea how I can fix this?
> 
> Thanks,
> Mohica
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] OCF Resource agent promote question

2013-03-25 Thread Andreas Kurz

Hi Steve,

On 2013-03-25 18:44, Steven Bambling wrote:
> All,
> 
> I'm trying to work on a OCF resource agent that uses postgresql
> streaming replication.  I'm running into a few issues that I hope might
> be answered or at least some pointers given to steer me in the right
> direction.

Why are you not using the existing pgsql RA? It is capable of doing
synchronous and asynchronous replication and it is known to work fine.

Best regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> 
> 1.  A quick way of obtaining a list of "Online" nodes in the cluster
> that a resource will be able to migrate to.  I've accomplished it with
> some grep and see but its not pretty or fast.
> 
> # time pcs status | grep Online | sed -e "s/.*\[\(.*\)\]/\1/" | sed 's/ //'
> p1.example.net  p2.example.net
> 
> 
> real0m2.797s
> user0m0.084s
> sys0m0.024s
> 
> Once I get a list of active/online nodes in the cluster my thinking was
> to use PSQL to get the current xlog location and lag or each of the
> remaining nodes and compare them.  If the node has a greater log
> position and/or less lag it will be given a greater master preference.  
> 
> 2.  How to force a monitor/probe before a promote is run on ALL nodes to
> make sure that the master preference is up to date before
> migrating/failing over the resource.
> - I was thinking that maybe during the promote call it could get the log
> location and lag from each of the nodes via an psql call ( like above)
> and then force the resource to a specific node.  Is there a way to do
> this and does it sound like a sane idea ?
> 
> 
> The start of my RA is located here suggestions and comments 100%
> welcome https://github.com/smbambling/pgsqlsr/blob/master/pgsqlsr
> 
> v/r
> 
> STEVE
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] pacemaker node stuck offline

2013-03-21 Thread Andreas Kurz

On 2013-03-21 14:31, Patrick Hemmer wrote:
> I've got a 2-node cluster where it seems last night one of the nodes
> went offline, and I can't see any reason why.
> 
> Attached are the logs from the 2 nodes (the relevant timeframe seems to
> be 2013-03-21 between 06:05 and 06:10).
> This is on ubuntu 12.04

Looks like your non-redundant cluster-communication was interrupted at
around that time for whatever reason and your cluster split-brained.

Does the drbd-replication use a different network-connection? If yes,
why not using it for a redundant ring setup ... and you should use STONITH.

I also wonder why you have defined "expected_votes='1'" in your
cluster.conf.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> 
> # crm status
> 
> Last updated: Thu Mar 21 13:17:21 2013
> Last change: Thu Mar 14 14:42:18 2013 via crm_shadow on i-a706d8ff
> Stack: cman
> Current DC: i-a706d8ff - partition WITHOUT quorum
> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
> 2 Nodes configured, unknown expected votes
> 5 Resources configured.
> 
> 
> Online: [ i-a706d8ff ]
> OFFLINE: [ i-3307d96b ]
> 
>  dns-postgresql(ocf::cloud:route53):Started i-a706d8ff
>  Master/Slave Set: ms-drbd-postgresql [drbd-postgresql]
>  Masters: [ i-a706d8ff ]
>  Stopped: [ drbd-postgresql:0 ]
>  fs-drbd-postgresql(ocf::heartbeat:Filesystem):Started i-a706d8ff
>  postgresql(ocf::heartbeat:pgsql):Started i-a706d8ff
> 
> 
> # cman_tool nodes
> Node  Sts   Inc   Joined   Name
> 181480898   M  4   2013-03-14 14:25:27  i-3307d96b
> 181481642   M   5132   2013-03-21 06:07:40  i-a706d8ff
> 
> 
> # cman_tool status
> Version: 6.2.0
> Config Version: 1
> Cluster Name: cloudapp-servic
> Cluster Id: 63629
> Cluster Member: Yes
> Cluster Generation: 5132
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 1
> Total votes: 2
> Node votes: 1
> Quorum: 2 
> Active subsystems: 4
> Flags:
> Ports Bound: 0 
> Node name: i-3307d96b
> Node ID: 181480898
> Multicast addresses: 255.255.255.255
> Node addresses: 10.209.45.194
> 
> 
> 
> # cat /etc/cluster/cluster.conf
> 
> 
>  syslog_priority='debug' />
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] ping resource polling skew

2013-03-20 Thread Andreas Kurz

On 2013-03-20 04:11, Quentin Smith wrote:
> On Wed, 20 Mar 2013, Andreas Kurz wrote:
> 
>> On 2013-03-19 17:02, Quentin Smith wrote:
>>> Hi-
>>>
>>> I have my cluster configured to use a cloned ping resource, such that I
>>> can write a constraint that I prefer resources to run on a node that has
>>> network connectivity. That works fine if a machine loses its network
>>> connection (the ping attribute goes to 0, resources migrate to another
>>> machine, etc.).
>>>
>>> However, if instead what happens is the ping /target/ goes offline, it
>>> seems that Pacemaker will bounce resources around the cluster, as each
>>> node notices that the ping target is unreachable at a slightly different
>>> time.
>>
>> using more than one targets is always a good idea and choose targets
>> that are also highly available
> 
> Sure, and we have. No choice of ping targets is going to help us if the
> servers are partitioned from the rest of the network, through.

In that case typically no-one can access the services anyways, then not
having resources bouncing around is more a cosmetic issue ... assuming
it is not extremely expensive to start/stop them e.g. because of cold
caches.

> 
>>> Is there any way to get Pacemaker to delay resource transitions until at
>>> least one full polling cycle has happened, so that in the event of an
>>> outage of the ping target, resources stay put where they are running?
>>
>> there is the "dampen" parameter  use a high value like 3 or more
>> times the monitor-interval to give all nodes the chance to detect the
>> dead target(s), that should help.
> 
> Does that actually help in this case? My understanding is that the
> dampen parameter will delay the attribute change for each host, but
> those delays will still tick down separately for each node, resulting in
> exactly the same behavior, just delayed by dampen seconds.

if the dampen time-out is reached and there was a permanent change of
that attribute on one node, all nodes flush their current value ... so
yes, that should actually help.

Regards,
Andreas

> 
> --Quentin
> 
>>
>> Regards,
>> Andreas
>>
>> -- 
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>>
>>> --Quentin
>>>
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] ping resource polling skew

2013-03-19 Thread Andreas Kurz

On 2013-03-19 17:02, Quentin Smith wrote:
> Hi-
> 
> I have my cluster configured to use a cloned ping resource, such that I
> can write a constraint that I prefer resources to run on a node that has
> network connectivity. That works fine if a machine loses its network
> connection (the ping attribute goes to 0, resources migrate to another
> machine, etc.).
> 
> However, if instead what happens is the ping /target/ goes offline, it
> seems that Pacemaker will bounce resources around the cluster, as each
> node notices that the ping target is unreachable at a slightly different
> time.

using more than one targets is always a good idea and choose targets
that are also highly available

> 
> Is there any way to get Pacemaker to delay resource transitions until at
> least one full polling cycle has happened, so that in the event of an
> outage of the ping target, resources stay put where they are running?

there is the "dampen" parameter  use a high value like 3 or more
times the monitor-interval to give all nodes the chance to detect the
dead target(s), that should help.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> --Quentin
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Trouble with DRBD mount

2013-02-28 Thread Andreas Kurz

On 2013-02-28 13:19, senrab...@aol.com wrote:
> Hi All:
> 
> We are stuck trying to get pacemaker to work with DRBD, and having tried
> various alternatives can't get our "drbd1" to mount and get some errors.
> 
> NOTE:  we are trying to get pacemaker to work with an existing Encrypted
> RAID1 LVM setup - is this impossible or a "just plain bad idea"?   We
> were thinking we'd like the potential advantages of local RAID on each
> box as well as the Internet RAID & failover provided by DRBD/pacemaker.
>  We're using Debian Squeeze.  Per various instructions, we've disabled
> the DRBD boot init (update-rc.d -f drbd remove) and set the LVM filter
> to filter = [ "a|drbd.*|", "r|.*|" ].

so you only allow scanning for LVM signatures on DRBD ... that needs to
be in Primary mode before 

> 
> FYI - we've commented out the LVM mount "/dev/vg2/vserverLV" in our
> fstab, and consistently seem to need to do this to avoid a boot error.
> 
> We think DRBD works until we add in the pacemaker steps (i.e.,
> "dev/drbd1" mounts at boot; we can move related data from server1 to
> server2 back and forth, though need to use the command line to
> accomplish this).  We've seen various statements on the net that suggest
> it is viable to use a "mapper" disk choice in drbd.conf.  Also, if we
> start by configuring Pacemaker for a simple IP failover, that works
> (i.e., no errors, we can ping via the fail over address) but stops
> working when we add in the DRBD primatives and related statements.  Our
> suspicion (other than maybe "you can't do this with existing RAID") is
> that we're using the wrong "disk" statement in our drbd.conf or maybe in
> our "primitive fs_vservers" statement, though we've tried lots of
> alternatives and this is the same drbd.conf we use before adding in
> Pacemaker and it seems to work at that point.
> 
> Lastly, while various config statements refer to "vservers", we have not
> gotten to the point of trying to add any data to the DRBD devices other
> than a few text files that have disappeared since doing our "crm" work.
> 
> Any help appreciated!  Thanks, Ted
> 
> CONFIGS/LOGS
> 
> A) drbd.conf
> 
> global { usage-count no; }
> common { syncer { rate 100M; } }
> #original
> resource r1 {
> protocol C;
> startup {
> wfc-timeout  15;
> degr-wfc-timeout 60;
> }
> device /dev/drbd1 minor 1;
>   disk /dev/vg2/vserverLV;

so vg2/vserverLV is the lower-level device for DRBD, simply let vg2 be
automatically activated and forget that LVM filter thing you did, that
is only needed for vgs sitting _on_ DRBD, not below.

> meta-disk internal;
> 
> # following 2 definition are equivalent
> on server1 {
> address 192.168.1.129:7801;
>  disk /dev/vg2/vserverLV;
> }
> on server2 {
> address 192.168.1.128:7801;
>  disk /dev/vg2/vserverLV;
> #disk /dev/mapper/md2_crypt;
> }
> 
> #   floating 192.168.5.41:7801;
> #   floating 192.168.5.42:7801;
>  net {
> cram-hmac-alg sha1;
> shared-secret "secret";
>   after-sb-0pri discard-younger-primary;
> #discard-zero-changes;
>   after-sb-1pri discard-secondary;
>   after-sb-2pri call-pri-lost-after-sb;
> }
> }
> 
> 
> B) Pacemaker Config
> 
> crm configure show
> node server1
> node server2
> primitive app_ip ocf:heartbeat:IPaddr \
> params ip="192.168.1.152" \
> op monitor interval="30s"
> primitive drbd ocf:linbit:drbd \
> params drbd_resource="r1" \
> op start interval="0" timeout="240" \
> op stop interval="0" timeout="100" \
> op monitor interval="59s" role="Master" timeout="30s" \
> op monitor interval="60s" role="Slave" timeout="30s"
> primitive fs_vservers ocf:heartbeat:Filesystem \
> params device="/dev/drbd1" directory="/vservers" fstype="ext4" \
> op start interval="0" timeout="60" \
> op stop interval="0" timeout="120"
> primitive vg2 ocf:heartbeat:LVM \
> params volgrpname="vg2" exclusive="true" \

simply remove all that LVM things from your pacemaker configuration

> op start interval="0" timeout="30" \
> op stop interval="0" timeout="30"
> group lvm app_ip vg2 fs_vservers

ouch .. a group called "lvm", am I the only one who thinks this is
confusing?

> ms ms_drbd drbd \
> meta master-node-max="1" clone-max="2" clone-node-max="1"
> globally-unique="false" notify="true" target-role="Master"
> location drbd_on_node1 ms_drbd \
> rule $id="drbd_on_node1-rule" $role="master" 100: #uname eq server1
> colocation vserver-deps inf: ms_drbd:Master lvm

wrong direction  .. you want the group follow the DRBD master

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> order app_on_drbd inf: ms_drbd:promote lvm:start
> property $id="cib-bootstrap-options" \
> dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith

Re: [Pacemaker] crm in RHEL 6.4 ... where are you?

2013-02-22 Thread Andreas Kurz

On 2013-02-21 23:16, Bob Haxo wrote:
> Greetings,
> 
> Anyone know where "crm" is in RHEL 6.4, or in the most recent set of
> RHEL 6.3 updates?  crm is not included in the latest pacemaker-cli
> package: pacemaker-cli-1.1.8-7.el6.x86_64.rpm

It is here:

http://download.opensuse.org/repositories/network:/ha-clustering/RedHat_RHEL-6/x86_64/

crmsh is now its own project:
http://savannah.nongnu.org/projects/crmsh/

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Monitor process, migrate only ip resources

2013-02-19 Thread Andreas Kurz

On 2013-02-19 13:54, Grant Bagdasarian wrote:
> Hello,
> 
>  
> 
> I wish to monitor a certain running process and migrate floating IP
> addresses when this process stops running.
> 
>  
> 
> My current configuration is as following:
> 
> crm(live)configure# show
> 
> node $id="8fe81814-6e85-454f-b77b-5783cc18f4c6" proxy1
> 
> node $id="ceb5c90f-ee6a-44b9-b722-78781f6a61ab" proxy2
> 
> primitive sip_ip ocf:heartbeat:IPaddr \
> 
> params ip="10.0.0.1" cidr_netmask="255.255.255.0" nic="eth1" \
> 
> op monitor interval="40s" timeout="20s"
> 
> primitive sip_ip_2 ocf:heartbeat:IPaddr \
> 
> params ip="10.0.0.2" cidr_netmask="255.255.255.0" nic="eth1" \
> 
> op monitor interval="40s" timeout="20s"
> 
> primitive sip_ip_3 ocf:heartbeat:IPaddr \
> 
> params ip="10.0.0.3" cidr_netmask="255.255.255.0" nic="eth1" \
> 
> op monitor interval="40s" timeout="20s"
> 
> location sip_ip_pref sip_ip 100: proxy1
> 
> location sip_ip_pref_2 sip_ip_2 101: proxy1
> 
> location sip_ip_pref_3 sip_ip_3 102: proxy1
> 
> property $id="cib-bootstrap-options" \
> 
> dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
> 
> cluster-infrastructure="Heartbeat" \
> 
> stonith-enabled="false"
> 
>  
> 
> Couple days ago our kamailio process stopped and the ip resources
> weren’t migrated to our secondary node.
> 
> The secondary node already has the kamailio process running.
> 
>  
> 
> How do I configure ha so that the kamailio process is monitored every x
> seconds and when it has stopped the three ip addresses are migrated to
> the secondary node?

Either let Pacemaker manage kamailio as a clone resource or add it as
unmanaged clone resource only ... either way you then need to colocate
the ip adresses with that clone to run only where a kamailio clone
instance is up.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> 
>  
> 
> Grant
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] return properties and rsc_defaults back to default values

2013-02-14 Thread Andreas Kurz

Hi Brian,

On 2013-02-14 16:48, Brian J. Murrell wrote:
> Is there a way to return an individual property (or all properties)
> and/or a rsc_default (or all) back to default values, using crm, or
> otherwise?

You mean beside deleting it?

Cheers,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] stonith on node-add

2013-01-30 Thread Andreas Kurz

On 2013-01-30 20:51, Matthew O'Connor wrote:
> Hi!  I must be doing something stupidly wrong...  every time I add a new
> node to my live cluster, the first thing the cluster decides to do is
> STONITH the node, and despite any precautions I take (other than
> flat-out disabling STONITH during the reconfiguration).  Is this
> normal?  I'm currently running (sadly) Pacemaker 1.1.5.  It's not a big
> deal, just inconvenient, though it disturbs me regarding the stability
> of the other cluster nodes - not that they go down, but I want to know
> that what I'm doing isn't putting them at risk, either.

this is expected, the default is to fence unseen nodes to "clear" their
state  the cluster-property to influence this behavior:

startup-fencing (boolean, [true]): STONITH unseen nodes

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Thanks!!
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 







signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Best way to recover from failed STONITH?

2012-12-21 Thread Andreas Kurz

On 12/21/2012 07:47 PM, Andrew Martin wrote:
> Andreas,
> 
> Thanks for the help. Please see my replies inline below.
> 
> - Original Message -
>> From: "Andreas Kurz" 
>> To: pacemaker@oss.clusterlabs.org
>> Sent: Friday, December 21, 2012 10:11:08 AM
>> Subject: Re: [Pacemaker] Best way to recover from failed STONITH?
>>
>> On 12/21/2012 04:18 PM, Andrew Martin wrote:
>>> Hello,
>>>
>>> Yesterday a power failure took out one of the nodes and its STONITH
>>> device (they share an upstream power source) in a 3-node
>>> active/passive cluster (Corosync 2.1.0, Pacemaker 1.1.8). After
>>> logging into the cluster, I saw that the STONITH operation had
>>> given up in failure and that none of the resources were running on
>>> the other nodes:
>>> Dec 20 17:59:14 [18909] quorumnode   crmd:   notice:
>>> too_many_st_failures:   Too many failures to fence node0 (11),
>>> giving up
>>>
>>> I brought the failed node back online and it rejoined the cluster,
>>> but no more STONITH attempts were made and the resources remained
>>> stopped. Eventually I set stonith-enabled="false" ran killall on
>>> all pacemaker-related processes on the other (remaining) nodes,
>>> then restarted pacemaker, and the resources successfully migrated
>>> to one of the other nodes. This seems like a rather invasive
>>> technique. My questions about this type of situation are:
>>>  - is there a better way to tell the cluster "I have manually
>>>  confirmed this node is dead/safe"? I see there is the meatclient
>>>  command, but can that only be used with the meatware STONITH
>>>  plugin?
>>
>> crm node cleanup quorumnode
> 
> I'm using the latest version of crmsh (1.2.1) but it doesn't seem to support 
> this command:

ah ... sorry, true ... its the "clearstate" command ... but it does a
"cleanup" ;-)

> root@node0:~# crm --version
> 1.2.1 (Build unknown)
> root@node0:~# crm node
> crm(live)node# help
> 
> Node management and status commands.
> 
> Available commands:
> 
>   status   show nodes' status as XML
>   show show node
>   standby  put node into standby
>   online   set node online
>   fencefence node
>   clearstate   Clear node state
>   delete   delete node
>   attributemanage attributes
>   utilization  manage utilization attributes
>   status-attr  manage status attributes
>   help show help (help topics for list of topics)
>   end  go back one level
>   quit exit the program
> Also, do I run cleanup on just the node that failed, or all of them?

You need to specify a node with this command and you only need/should do
this for the failed node.

> 
> 
>>
>>>  - in general, is there a way to force the cluster to start
>>>  resources, if you just need to get them back online and as a
>>>  human have confirmed that things are okay? Something like crm
>>>  resource start rsc --force?
>>
>> ... see above ;-)
> 
> On a related note, is there a way to way to get better information
> about why the cluster is in its current state? For example, in this 
> situation it would be nice to be able to run a command and have the
> cluster print "resources stopped until node XXX can be fenced" to
> be able to quickly assess the problem with the cluster.

yeah  not all cluster command outputs and logs are user-friendly ;-)
... sorry I'm not aware of a direct way to get better information, maybe
someone else?

> 
>>
>>>  - how can I completely clear out saved data for the cluster and
>>>  start over from scratch (last-resort option)? Stopping pacemaker
>>>  and removing everything from /var/lib/pacemaker/cib and
>>>  /var/lib/pacemaker/pengine cleans the CIB, but the nodes end up
>>>  sitting in the "pending" state for a very long time (30 minutes
>>>  or more). Am I missing another directory that needs to be
>>>  cleared?
>>
>> you started with an completely empty cib and the two (or three?)
>> nodes
>> needed 30min to form a cluster?
> Yes, in fact I cleared out both /var/lib/pacemaker/cib and 
> /var/lib/pacemaker/pengine
> several times and most of the times after starting pacemaker again
> one node would become "online" pretty quickly (less than 5 minutes), but the 
> other two 
> would remain "pending" for quite some

Re: [Pacemaker] Best way to recover from failed STONITH?

2012-12-21 Thread Andreas Kurz

On 12/21/2012 04:18 PM, Andrew Martin wrote:
> Hello,
> 
> Yesterday a power failure took out one of the nodes and its STONITH device 
> (they share an upstream power source) in a 3-node active/passive cluster 
> (Corosync 2.1.0, Pacemaker 1.1.8). After logging into the cluster, I saw that 
> the STONITH operation had given up in failure and that none of the resources 
> were running on the other nodes:
> Dec 20 17:59:14 [18909] quorumnode   crmd:   notice: 
> too_many_st_failures:   Too many failures to fence node0 (11), giving up
> 
> I brought the failed node back online and it rejoined the cluster, but no 
> more STONITH attempts were made and the resources remained stopped. 
> Eventually I set stonith-enabled="false" ran killall on all pacemaker-related 
> processes on the other (remaining) nodes, then restarted pacemaker, and the 
> resources successfully migrated to one of the other nodes. This seems like a 
> rather invasive technique. My questions about this type of situation are:
>  - is there a better way to tell the cluster "I have manually confirmed this 
> node is dead/safe"? I see there is the meatclient command, but can that only 
> be used with the meatware STONITH plugin?

crm node cleanup quorumnode

>  - in general, is there a way to force the cluster to start resources, if you 
> just need to get them back online and as a human have confirmed that things 
> are okay? Something like crm resource start rsc --force?

... see above ;-)

>  - how can I completely clear out saved data for the cluster and start over 
> from scratch (last-resort option)? Stopping pacemaker and removing everything 
> from /var/lib/pacemaker/cib and /var/lib/pacemaker/pengine cleans the CIB, 
> but the nodes end up sitting in the "pending" state for a very long time (30 
> minutes or more). Am I missing another directory that needs to be cleared?

you started with an completely empty cib and the two (or three?) nodes
needed 30min to form a cluster?

> 
> I am going to look into making the power source for the STONITH device 
> independent of the power source for the node itself, however even with that 
> setup there's still a chance that something could take out both power sources 
> at the same time, in which case manual intervention and confirmation that the 
> node is dead would be required.

Pacemaker 1.1.8 supports (again) stonith topologies ... so more than one
fencing device and they can be "logically" combined.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Thanks,
> 
> Andrew
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 







signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] reloading crm changes

2012-12-17 Thread Andreas Kurz

On 12/18/2012 12:58 AM, Paul Shannon - NOAA Federal wrote:
> Andreas,
> 
> I do have  no-quorum-policy="ignore"  set and  stonith-enabled="false".
> Also, I do have some resources running.  Its just when I tried to add
> another one that I cannot get it to take.

what does "crm_mon -1frA" show?  and of course logs should give all
information needed ...

Regards,
Andreas

> 
> Paul
> 
> -
> Speak the truth, but leave immediately after. - Slovenian proverb//
> /
> /Paul Shannon mailto:paul.shan...@noaa.gov>>
> ITO, WFO Juneau
> NOAA, National Weather Service
> 
> 
> 
> 
> On Mon, Dec 17, 2012 at 11:22 PM, Andreas Kurz  <mailto:andr...@hastexo.com>> wrote:
> 
> On 12/17/2012 11:29 PM, Paul Shannon - NOAA Federal wrote:
> > I'm just getting our cluster set up and seem to be missing something
> > about changes made using the crm program. I added some resources and
> > groups using crm => configure => edit.  After saving and committing my
> > changes I can see the new resources in resource => show but they are
> > stopped.  After running  start   they are still stopped.
> > Also, exiting and running crm_mon does *not* show the new
> resources.  I
> > tried a  clean   just in case, but that did not change
> > anything either.
> 
> By default stonith is enabled  you have configured a
> stonith-resource? If not, resource management is disabled until you do
> ... or disable stonith ... and you need quorum if you don't ignore
> it 
> 
> Regards,
> Andreas
> 
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
> 
> >
> > I thought the whole idea of the live resources was they took effect
> > immediately. Am I missing a step?
> >
> > Paul Shannon
> > -
> > Speak the truth, but leave immediately after. - Slovenian proverb//
> > /
> > /Paul Shannon  <mailto:paul.shan...@noaa.gov> <mailto:paul.shan...@noaa.gov
> <mailto:paul.shan...@noaa.gov>>>
> > ITO, WFO Juneau
> > NOAA, National Weather Service
> >
> >
> >
> >
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> <mailto:Pacemaker@oss.clusterlabs.org>
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> <mailto:Pacemaker@oss.clusterlabs.org>
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] reloading crm changes

2012-12-17 Thread Andreas Kurz

On 12/17/2012 11:29 PM, Paul Shannon - NOAA Federal wrote:
> I'm just getting our cluster set up and seem to be missing something
> about changes made using the crm program. I added some resources and
> groups using crm => configure => edit.  After saving and committing my
> changes I can see the new resources in resource => show but they are
> stopped.  After running  start   they are still stopped. 
> Also, exiting and running crm_mon does *not* show the new resources.  I
> tried a  clean   just in case, but that did not change
> anything either. 

By default stonith is enabled  you have configured a
stonith-resource? If not, resource management is disabled until you do
... or disable stonith ... and you need quorum if you don't ignore it 

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> I thought the whole idea of the live resources was they took effect
> immediately. Am I missing a step?
> 
> Paul Shannon
> -
> Speak the truth, but leave immediately after. - Slovenian proverb//
> /
> /Paul Shannon mailto:paul.shan...@noaa.gov>>
> ITO, WFO Juneau
> NOAA, National Weather Service
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Constraint on location of resource start

2012-11-09 Thread Andreas Kurz

On 11/09/2012 08:49 AM, Cserbák Márton wrote:
> Hi,
> 
> I have successfully set up a DRBD+Pacemaker+Xen cluster on two Debian
> servers. Unfortunately, I am facing the same issue as the one
> described in https://bugzilla.redhat.com/show_bug.cgi?id=694492,
> namely, that the CPU features differ on the two servers. When the domU
> resources, that were originally created on the newer server get
> migrated to the older server, they usually crash with libc trap
> messages on the console. Live migration of domUs originally created on
> the older server is without problem in both ways.
> 
> What I was trying to figure out, if there was any way to configure
> pacemaker to allow creation of domUs on the older server only, and
> allow migration of resources to the newer server.

a simple location constraint preferring the older server should be
sufficient ... for most cases ...

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Any help is appreciated. Best regards,
> 
> Márton Cserbák
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] MySQL and PostgreSQL on same node with DRBD and floating IPs: suggestions wanted

2012-10-27 Thread Andreas Kurz

On 10/26/2012 02:00 PM, Denny Schierz wrote:
> hi,
> 
> I'm playing with Pacemaker from Debian squeeze-backports to get failover 
> running, for PostgreSQL and MySQL(mariaDB) on the same node with two DRBD 
> resources and two floating VIPs. It seems to be working, the failover works, 
> but I want some suggestions, if my config is O.K, or what I should do better, 
> so any suggestions are welcome :-)
> 
> corosync and drbd communicates over a dedicated interface (eth1 / eth2) ...
> 
> here we are:
> 
> ==
> root@SQLNODE-02:~# crm configure show
> node SQLNODE-01
> node SQLNODE-02
> primitive drbd-mysql ocf:linbit:drbd \
>   params drbd_resource="mysql" \
>   op monitor interval="29s" role="Master" \
>   op monitor interval="31s" role="Slave"
> primitive drbd-postgres ocf:linbit:drbd \
>   params drbd_resource="postgres" \
>   op monitor interval="29s" role="Master" \
>   op monitor interval="31s" role="Slave"
> primitive mysql-fs ocf:heartbeat:Filesystem \
>   params device="/dev/drbd1" directory="/var/lib/mysql" fstype="ext4"
> primitive mysql-ip ocf:heartbeat:IPaddr2 \
>   params ip="192.168.1.10" nic="eth0" cidr_netmask="24" \
>   op monitor interval="10s"
> primitive mysqld lsb:mysql
> primitive postgres-fs ocf:heartbeat:Filesystem \
>   params device="/dev/drbd0" directory="/var/lib/postgresql" fstype="ext4"
> primitive postgres-ip ocf:heartbeat:IPaddr2 \
>   params ip="192.168.1.11" nic="eth0" cidr_netmask="24" \
>   op monitor interval="10s"
> primitive postgresql ocf:heartbeat:pgsql \
>   params pgctl="/usr/lib/postgresql/9.1/bin/pg_ctl" 
> psql="/usr/lib/postgresql/9.1/bin/psql" 
> pgdata="/var/lib/postgresql/9.1/main/" 
> logfile="/var/log/postgresql/postgresql-9.1-main.log" \
>   op monitor interval="30" timeout="30" depth="0"
> group mysql mysql-fs mysql-ip mysqld
> group postgres postgres-fs postgres-ip postgresql
> ms ms-drbd-mysql drbd-mysql \
>   meta master-max="1" master-node-max="1" clone-max="2" 
> clone-node-max="1" notify="true"
> ms ms-drbd-postgres drbd-postgres \
>   meta master-max="1" master-node-max="1" clone-max="2" 
> clone-node-max="1" notify="true"
> location master-prefer-node1 postgres 50: SQLNODE-01
> colocation all-db-ips inf: mysql postgres

this mandatory colocation prevents mysql from running if postgres group
can't run ... ok, if this was your intention ... and you need to
colocate each group with the Master role of its drbd device

> order mysql-after-drbd inf: ms-drbd-mysql:promote mysql:start
> order postgres-after-drbd inf: ms-drbd-postgres:promote postgres:start
> property $id="cib-bootstrap-options" \
>   dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
>   cluster-infrastructure="openais" \
>   expected-quorum-votes="2" \
>   no-quorum-policy="ignore" \
>   stonith-enabled="false" \
>   default-resource-stickiness="1" \
>   last-lrm-refresh="1351243858"
> = 
> 
> stonith is missing .. I should definitely  enable it ...

yes, you should ;-)

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> 
> So, what might be wrong?
> 
> cu denny
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 







signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Behavior of Corosync+Pacemaker with DRBD primary power loss

2012-10-25 Thread Andreas Kurz

On 10/24/2012 04:03 PM, Andrew Martin wrote:
> Hi Andreas,
> 
> - Original Message -
>> From: "Andreas Kurz" 
>> To: pacemaker@oss.clusterlabs.org
>> Sent: Wednesday, October 24, 2012 4:13:03 AM
>> Subject: Re: [Pacemaker] Behavior of Corosync+Pacemaker with DRBD primary 
>> power  loss
>>
>> On 10/23/2012 05:04 PM, Andrew Martin wrote:
>>> Hello,
>>>
>>> Under the Clusters from Scratch documentation, allow-two-primaries
>>> is
>>> set in the DRBD configuration for an active/passive cluster:
>>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html-single/Clusters_from_Scratch/index.html#_write_the_drbd_config
>>>
>>> "TODO: Explain the reason for the allow-two-primaries option"
>>>
>>> Is the reason for allow-two-primaries in this active/passive
>>> cluster
>>> (using ext4, a non-cluster filesystem) to allow for failover in the
>>> type
>>> of situation I have described (where the old primary/master is
>>> suddenly
>>> offline like with a power supply failure)? Are split-brains
>>> prevented
>>> because Pacemaker ensures that only one node is promoted to Primary
>>> at
>>> any time?
>>
>> no "allow-two-primaries" needed in an active/passive setup, the
>> fence-handler (executed on the Primary if connection to Secondary is
>> lost) inserts a location-constraint into the Pacemaker configuration
>> so
>> the cluster does not even "think about" to promote an outdated
>> Secondary
>>
>>>
>>> Is it possible to recover from such a failure without
>>> allow-two-primaries?
>>
>> Yes. If you only disconnect DRBD as in you test described below and
>> cluster communication over redundant network is still possible (and
>> Pacemaker is up and running), the Primary will insert that
>> location-constraint and prevents a Secondary from becoming Primary
>> because the constraint is already placed ... if Pacemaker is _not_
>> running during your disconnection test, you also receive an error
>> because obviously it is also impossible to place that constraint.
>>
> 
> What about the situation where the primary, node0, is running alone fine but 
> then its power supply fails (or a kernel panic, or some other critical 
> hardware failure), resulting in it instantly being shut off? The resources 
> should failover to the secondary node, node1, however node1's DRBD device 
> will have the following state:
> 
> Role:
> Secondary/Unknown
> 
> Disk State:
> UpToDate/DUnknown
> 
> Connection State:
> WFConnection
> 
> DRBD will refuse to allow this node to be promoted to primary:
> 0: State change failed: (-7) Refusing to be Primary while peer is not outdated
> Command 'drbdsetup 0 primary' terminated with exit code 11
> 
> Does Pacemaker have some mechanism for (on node1) being able to outdate 
> node0, the old master/primary, in order to promote the DRBD resource?
> 

You mean you are already running a disconnected Primary for longer time
and then it dies? That cannot be handled automatically by Pacemaker if
you set up fencing in DRBD ... and I doubt you want that, because the
disconnected Secondary will most probably be at a data-version
"somewhere in the past" because it received no updates since the
disconnect... if this is your only valid data source and the lost
Primary is unable to recovery you could manually force DRBD do become
Primary ... but better have your monitoring service detect a
disconnected DRBD resource early and fix it ;-)

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> Thanks,
> 
> Andrew
> 
> 
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>> 
>>> *From: *"Andrew Martin" 
>>> *To: *"The Pacemaker cluster resource manager"
>>> 
>>> *Sent: *Friday, October 19, 2012 10:45:04 AM
>>> *Subject: *[Pacemaker] Behavior of Corosync+Pacemaker with DRBD
>>> primary
>>> powerloss
>>>
>>> Hello,
>>>
>>> I have a 3 node Pacemaker + Corosync cluster with 2 "real" nodes,
>>> node0
>>> and node1, running a DRBD resource (single-primary) and the 3rd
>>> node in
>>> standby acting as a quorum node. If node0 were running the DRBD
>>> resource, and thus is DRBD

Re: [Pacemaker] setup of a cluster - principal questions

2012-10-24 Thread Andreas Kurz

On 10/24/2012 11:47 AM, Lentes, Bernd wrote:
> Hi,
> 
> i'd like to establish a HA Cluster with two nodes. I will use SLES 11 SP2 + 
> HAE.
> I have a shared storage, it's a FC SAN. My services will run in vm's, one vm 
> for one service. The vm's will run using KVM.
> First i thought to install the VM's in plain partitions, without any 
> filesystem. I read that this will be faster.
> But i tried installing two vm's, each one in a raw file, and this is fast 
> enough for me.
> On the shared storage i will create OCFS as filesystem.

Your servers have a (mandatory) STONITH device  like IPMI, rilo?

> 
> Now my questions:
> - is live migration possible with this setup ?

Yes

> - is it possible to run some vm's one node 1 and some others on node 2 ? As a 
> kind of load balancing ?
> - what do you think about my setup ?

definitely possible, very common setup.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now
> 
> 
> 
> Thanks for amy answer.
> 
> 
> Bernd
> 
> 
> --
> Bernd Lentes
> 
> Systemadministration
> Institut für Entwicklungsgenetik
> Gebäude 35.34 - Raum 208
> HelmholtzZentrum münchen
> bernd.len...@helmholtz-muenchen.de
> phone: +49 89 3187 1241
> fax:   +49 89 3187 2294
> http://www.helmholtz-muenchen.de/idg
> 
> Wir sollten nicht den Tod fürchten, sondern
> das schlechte Leben
> 
> Helmholtz Zentrum München
> Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH)
> Ingolstädter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir´in Bärbel Brumme-Bothe
> Geschäftsführer: Prof. Dr. Günther Wess und Dr. Nikolaus Blum
> Registergericht: Amtsgericht München HRB 6466
> USt-IdNr: DE 129521671
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 







signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Behavior of Corosync+Pacemaker with DRBD primary power loss

2012-10-24 Thread Andreas Kurz

On 10/23/2012 05:04 PM, Andrew Martin wrote:
> Hello,
> 
> Under the Clusters from Scratch documentation, allow-two-primaries is
> set in the DRBD configuration for an active/passive cluster:
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html-single/Clusters_from_Scratch/index.html#_write_the_drbd_config
> 
> "TODO: Explain the reason for the allow-two-primaries option"
> 
> Is the reason for allow-two-primaries in this active/passive cluster
> (using ext4, a non-cluster filesystem) to allow for failover in the type
> of situation I have described (where the old primary/master is suddenly
> offline like with a power supply failure)? Are split-brains prevented
> because Pacemaker ensures that only one node is promoted to Primary at
> any time?

no "allow-two-primaries" needed in an active/passive setup, the
fence-handler (executed on the Primary if connection to Secondary is
lost) inserts a location-constraint into the Pacemaker configuration so
the cluster does not even "think about" to promote an outdated Secondary

> 
> Is it possible to recover from such a failure without allow-two-primaries?

Yes. If you only disconnect DRBD as in you test described below and
cluster communication over redundant network is still possible (and
Pacemaker is up and running), the Primary will insert that
location-constraint and prevents a Secondary from becoming Primary
because the constraint is already placed ... if Pacemaker is _not_
running during your disconnection test, you also receive an error
because obviously it is also impossible to place that constraint.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now
> 
> Thanks,
> 
> Andrew
> 
> 
> *From: *"Andrew Martin" 
> *To: *"The Pacemaker cluster resource manager"
> 
> *Sent: *Friday, October 19, 2012 10:45:04 AM
> *Subject: *[Pacemaker] Behavior of Corosync+Pacemaker with DRBD primary
> powerloss
> 
> Hello,
> 
> I have a 3 node Pacemaker + Corosync cluster with 2 "real" nodes, node0
> and node1, running a DRBD resource (single-primary) and the 3rd node in
> standby acting as a quorum node. If node0 were running the DRBD
> resource, and thus is DRBD primary, and its power supply fails, will the
> DRBD resource be promoted to primary on node1?
> 
> If I simply cut the DRBD replication link, node1 reports the following
> state:
> Role:
> Secondary/Unknown
> 
> Disk State:
> UpToDate/DUnknown
> 
> Connection State:
> WFConnection
> 
> 
> I cannot manually promote the DRBD resource because the peer is not
> outdated:
> 0: State change failed: (-7) Refusing to be Primary while peer is not
> outdated
> Command 'drbdsetup 0 primary' terminated with exit code 11
> 
> I have configured the CIB-based crm-fence-peer.sh utility in my drbd.conf
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> but I do not believe it would be applicable in this scenario.
> 
> If node0 goes offline like this and doesn't come back (e.g. after a
> STONITH), does Pacemaker have a way to tell node1 that its peer is
> outdated and to proceed with promoting the resource to primary?
> 
> Thanks,
> 
> Andrew
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] "Simple" LVM/drbd backed Primary/Secondary NFS cluster doesn't always failover cleanly

2012-10-20 Thread Andreas Kurz

On 10/18/2012 08:02 PM, Justin Pasher wrote:
> I have a pretty basic setup by most people's standards, but there must
> be something that is not quite right about it. Sometimes when I force a
> resource failover from one server to the other, the clients with the NFS
> mounts don't cleanly migrate to the new server. I configured this using
> a few different "Pacemaker-DRBD-NFS" guides out there for reference (I
> believe they were the Linbit guides).

Are you using the latest "exportfs" resource-agent from github-repo? ...
there have been bugfixes/improvements... and try to move the VIP for
each export to the end of its group so the IP where the clients connect
is started at the last/stopped at the first position.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> 
> Sorry in advance for the long email.
> 
> Here is the config:
> --
> --
> * Two identical servers
> * Four exported NFS shares total (so I can independently fail over
> individual shares and run half on one server and half on the other)
> * Bonded interface using LACP for "outgoing" client access
> * Direct ethernet connection between the two servers (for
> Pacemaker/Corosync and DRBD)
> 
> Package versions (installed from either Debian Squeeze or Backports)
> * lvm 2.02.66-5
> * drbd 8.3.7-2.1
> * nfs-kernel-server 1.2.2-4squeeze2
> * pacemaker 1.1.7-1~bpo60+1
> * corosync 1.4.2-1~bpo60+1
> 
> Each NFS share is created using the same component format and has its
> own virtual IP.
> 
> Hardware RAID -> /dev/sdb -> LVM -> DRBD single master (one resource for
> each share)
> 
> 
> Here is the pacemaker config (I really hope it doesn't get mangled):
> 
> node storage1 \
> attributes standby="off"
> node storage2 \
> attributes standby="off"
> primitive p_drbd_distribion_storage ocf:linbit:drbd \
> params drbd_resource="distribion-storage" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_drbd_vni_storage ocf:linbit:drbd \
> params drbd_resource="vni-storage" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_drbd_xen_data1 ocf:linbit:drbd \
> params drbd_resource="xen-data1" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_drbd_xen_data2 ocf:linbit:drbd \
> params drbd_resource="xen-data2" \
> op monitor interval="15" role="Master" \
> op monitor interval="30" role="Slave"
> primitive p_exportfs_distribion_storage ocf:heartbeat:exportfs \
> params fsid="1" directory="/data/distribion-storage"
> options="rw,async,no_root_squash,subtree_check"
> clientspec="10.205.152.0/21" wait_for_leasetime_on_stop="false" \
> op monitor interval="30s"
> primitive p_exportfs_vni_storage ocf:heartbeat:exportfs \
> params fsid="2" directory="/data/vni-storage"
> options="rw,async,no_root_squash,subtree_check"
> clientspec="10.205.152.0/21" wait_for_leasetime_on_stop="false" \
> op monitor interval="30s"
> primitive p_exportfs_xen_data1 ocf:heartbeat:exportfs \
> params fsid="3" directory="/data/xen-data1"
> options="rw,async,no_root_squash,subtree_check"
> clientspec="10.205.152.0/21" wait_for_leasetime_on_stop="false" \
> op monitor interval="30s"
> primitive p_exportfs_xen_data2 ocf:heartbeat:exportfs \
> params fsid="4" directory="/data/xen-data2"
> options="rw,async,no_root_squash,subtree_check"
> clientspec="10.205.152.0/21" wait_for_leasetime_on_stop="false" \
> op monitor interval="30s"
> primitive p_fs_distribion_storage ocf:heartbeat:Filesystem \
> params fstype="xfs" directory="/data/distribion-storage"
> device="/dev/drbd1" \
> meta target-role="Started"
> primitive p_fs_vni_storage ocf:heartbeat:Filesystem \
> params fstype="xfs" directory="/data/vni-storage" device="/dev/drbd2"
> primitive p_fs_xen_data1 ocf:heartbeat:Filesystem \
> params fstype="xfs" directory="/data/xen-data1" device="/dev/drbd3" \
> meta target-role="Started"
> primitive p_fs_xen_data2 ocf:heartbeat:Filesystem \
> params fstype="xfs" directory="/data/xen-data2" device="/dev/drbd4" \
> meta target-role="Started"
> primitive p_ip_distribion_storage ocf:heartbeat:IPaddr2 \
> params ip="10.205.154.137" cidr_netmask="21" \
> op monitor interval="20s"
> primitive p_ip_vni_storage ocf:heartbeat:IPaddr2 \
> params ip="10.205.154.138" cidr_netmask="21" \
> op monitor interval="20s"
> primitive p_ip_xen_data1 ocf:heartbeat:IPaddr2 \
> params ip="10.205.154.139" cidr_netmask="21" \
> op monitor interval="20s"
> primitive p_ip_xen_data2 ocf:heartbeat:IPaddr2 \
> params ip="10.205.154.140" cidr_netmask="21" \
> op monitor interval="20s"
> primitive p_lsb_nfsserver lsb:nfs-kernel-server \
> op monitor interval="30s"
> primitive p_ping ocf:pacemaker:ping \
> params host_list="10.205.154.66" multip

Re: [Pacemaker] resource doesn't migrate after failcount is reached

2012-10-20 Thread Andreas Kurz

On 10/20/2012 12:53 PM, emmanuel segura wrote:
> Hello List
> 
> I have a stand alone resource and one group, i would like that when the
> stand alone resource reaches the failcount, the group doesn't migrate
> and the stand alone stays on the node where the group is situated

Then don't set a migration-threshold for that resource ... or maybe I
misunderstand what you want to achieve

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> 
> This is my test conf
> ~~~
> node suse01
> node suse02
> primitive apache lsb:apache2 \
> op monitor interval="60s"
> primitive dovecot lsb:dovecot \
> op monitor interval="20s"
> primitive dummy ocf:heartbeat:Dummy
> group apachegroup dummy apache
> colocation mycolo inf: dovecot apachegroup
> property $id="cib-bootstrap-options" \
> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith-enabled="false" \
> no-quorum-policy="ignore" \
> last-lrm-refresh="1350729134"
> rsc_defaults $id="rsc-options" \
> migration-threshold="4"
> ~
> 
> Thanks
> 
> 
> -- 
> esta es mi vida e me la vivo hasta que dios quiera
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] high cib load on config change

2012-10-09 Thread Andreas Kurz

On 10/09/2012 01:42 PM, James Harper wrote:
> As per previous post, I'm seeing very high cib load whenever I make a 
> configuration change, enough load that things timeout seemingly instantly. I 
> thought this was happening well before the configured timeout but now I'm not 
> so sure, maybe the timeouts are actually working okay and it just seems 
> instant. If the timeouts are in fact working correctly then it's keeping the 
> CPU at 100% for over 30 seconds to the exclusion of any monitoring checks (or 
> maybe locking the cib so the checks can't run?)
> 
> When I make a change I see the likes of this sort of thing in the logs (see 
> data below email), which I thought might be solved by this 
> https://github.com/ClusterLabs/pacemaker/commit/10e9e579ab032bde3938d7f3e13c414e297ba3e9
>  but i just checked the 1.1.7 source that the Debian packages are built from 
> and it turns out that that patch already exists in 1.1.7.
> 
> Are the messages below actually an indication of a problem? If I understand 
> it correctly it's failing to apply the configuration diff and is instead 
> forcing a full resync of the configuration across some or all nodes, which is 
> causing the high load.
> 
> I ran the crm_report but it includes a lot of information I really need to 
> remove so I'm reluctant to submit it in full unless it really all is required 
> to resolve the problem.
> 

You already did some tuning like increasing batch-limit in your cluster
properties and increased corosync timings? Hard to say more without
getting more information ... if your configuration details are too
sensitive to post on a public mailing-list you can of course hire
someone and give that information under NDA 

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> Thanks
> 
> James
> 
> Oct  9 21:35:30 bitvs2 cib: [6185]: info: apply_xml_diff: Digest mis-match: 
> expected e7f7aaa1eb10c7a633e94da57dfda2ac, calculated 
> 445109490690d53e024c333fac6ab4c9
> Oct  9 21:35:30 bitvs2 cib: [6185]: notice: cib_process_diff: Diff 0.1354.85 
> -> 0.1354.86 not applied to 0.1354.85: Failed application of an update diff
> Oct  9 21:35:30 bitvs2 cib: [6185]: info: cib_server_process_diff: Requesting 
> re-sync from peer
> Oct  9 21:35:30 bitvs2 cib: [6185]: notice: cib_server_process_diff: Not 
> applying diff 0.1354.85 -> 0.1354.86 (sync in progress)
> Oct  9 21:35:30 bitvs2 cib: [6185]: notice: cib_server_process_diff: Not 
> applying diff 0.1354.86 -> 0.1354.87 (sync in progress)
> Oct  9 21:35:30 bitvs2 cib: [6185]: notice: cib_server_process_diff: Not 
> applying diff 0.1354.86 -> 0.1354.87 (sync in progress)
> Oct  9 21:35:30 bitvs2 cib: [6185]: notice: cib_server_process_diff: Not 
> applying diff 0.1354.86 -> 0.1354.87 (sync in progress)
> Oct  9 21:35:30 bitvs2 cib: [6185]: notice: cib_server_process_diff: Not 
> applying diff 0.1354.87 -> 0.1355.1 (sync in progress)
> Oct  9 21:35:30 bitvs2 cib: [6185]: info: cib_process_diff: Diff 0.1355.1 -> 
> 0.1355.2 not applied to 0.1354.85: current "epoch" is less than required
> Oct  9 21:35:30 bitvs2 cib: [6185]: info: cib_server_process_diff: Requesting 
> re-sync from peer
> Oct  9 21:35:33 bitvs2 cib: [6185]: info: apply_xml_diff: Digest mis-match: 
> expected b77fae3dc1e835e0d6a3d1a305d262cb, calculated 
> 120fcac6996ff9f5148f69712fc54689
> Oct  9 21:35:33 bitvs2 cib: [6185]: notice: cib_process_diff: Diff 0.1357.7 
> -> 0.1357.8 not applied to 0.1357.7: Failed application of an update diff
> Oct  9 21:35:33 bitvs2 cib: [6185]: info: cib_server_process_diff: Requesting 
> re-sync from peer
> Oct  9 21:35:33 bitvs2 cib: [6185]: notice: cib_server_process_diff: Not 
> applying diff 0.1357.7 -> 0.1357.8 (sync in progress)
> Oct  9 21:35:33 bitvs2 cib: [6185]: notice: cib_server_process_diff: Not 
> applying diff 0.1357.8 -> 0.1358.1 (sync in progress)
> Oct  9 21:35:33 bitvs2 cib: [6185]: notice: cib_server_process_diff: Not 
> applying diff 0.1358.1 -> 0.1358.2 (sync in progress)
> Oct  9 21:35:33 bitvs2 cib: [6185]: notice: cib_server_process_diff: Not 
> applying diff 0.1358.2 -> 0.1358.3 (sync in progress)
> Oct  9 21:35:33 bitvs2 cib: [6185]: notice: cib_server_process_diff: Not 
> applying diff 0.1358.3 -> 0.1359.1 (sync in progress)
> Oct  9 21:35:33 bitvs2 cib: [6185]: info: cib_process_diff: Diff 0.1359.1 -> 
> 0.1359.2 not applied to 0.1357.7: current "epoch" is less than required
> Oct  9 21:35:33 bitvs2 cib: [6185]: info: cib_server_process_diff: Requesting 
> re-sync from peer
> Oct  9 21:35:33 bitvs2 cib: [6185]: notice: cib_server_process_diff: Not 
> applying diff 0.1359.2 -> 0.1359.3 (sync in progress)
> Oct  9 21:35:33 bitvs2 cib: [6185]: notice: cib_server_process_diff: Not 
> applying diff 0.1359.3 -> 0.1359.4 (sync in progress)
> Oct  9 21:35:33 bitvs2 cib: [6185]: notice: cib_server_process_diff: Not 
> applying diff 0.1359.4 -> 0.1359.5 (sync in progress)
> Oct  9 21:35:33 bitvs2 cib: [6185]: notice: cib_server_process_diff: Not 
> applying di

Re: [Pacemaker] Resource agent IPaddr2 failed to start

2012-10-09 Thread Andreas Kurz

On 10/09/2012 11:17 AM, Soni Maula Harriz wrote:
> 
> 
> On Tue, Oct 9, 2012 at 4:01 PM, Andreas Kurz  <mailto:andr...@hastexo.com>> wrote:
> 
> On 10/09/2012 10:39 AM, Soni Maula Harriz wrote:
> > Dear all,
> >
> > I'm a newbie in clustering. I have been following the 'Cluster from
> > scratch' tutorial.
> > I use Centos 6.3 and install pacemaker and corosync from : yum install
> > pacemaker corosync
> >
> > This is the version i got
> > Pacemaker 1.1.7-6.el6
> > Corosync Cluster Engine, version '1.4.1'
> >
> > This time i have been this far : adding IPaddr2 resource
> >
> 
> (http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_adding_a_resource.html)
> > everything goes well before adding IPaddr2 resource.
> > when i run 'crm status', it print out
> >
> > 
> > Last updated: Tue Oct  9 14:58:30 2012
> > Last change: Tue Oct  9 13:53:41 2012 via cibadmin on cluster1
> > Stack: openais
> > Current DC: cluster1 - partition with quorum
> > Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
> > 2 Nodes configured, 2 expected votes
> > 1 Resources configured.
> > 
> >
> > Online: [ cluster1 cluster2 ]
> >
> >
> > Failed actions:
> > ClusterIP_start_0 (node=cluster1, call=3, rc=6, status=complete):
> > not configured
> >
> >
> > This is the error i got from /var/log/message
> > Oct  8 17:07:16 cluster2 IPaddr2(ClusterIP)[15969]: ERROR:
> > [/usr/lib64/heartbeat/findif -C] failed
> > Oct  8 17:07:16 cluster2 crmd[15937]:  warning: status_from_rc:
> Action 4
> > (ClusterIP_start_0) on cluster2 failed (target: 0 vs. rc: 6): Error
> > Oct  8 17:07:16 cluster2 pengine[15936]:error: unpack_rsc_op:
> > Preventing ClusterIP from re-starting anywhere in the cluster :
> > operation start failed 'not configured' (rc=6)
> 
> Not enough information  please share your configuration (crm
> configure show) 
> 
>  
> This is the configuration :
> crm configure show
> node cluster1
> node cluster2
> primitive ClusterIP ocf:heartbeat:IPaddr2 \
> params ip="xxx.xxx.xxx.289" cidr_netmask="32" \

289??? ... hopefully, this is a typo, isn't it?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> op monitor interval="30s"
> property $id="cib-bootstrap-options" \
> dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith-enabled="false"
>  
> 
> and the result of "ip a sh". 
> 
> 
> # ip a sh
> 1: lo:  mtu 16436 qdisc noqueue state UNKNOWN
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> inet 127.0.0.1/8 <http://127.0.0.1/8> scope host lo
> inet6 ::1/128 scope host
>valid_lft forever preferred_lft forever
> 2: eth0:  mtu 1500 qdisc pfifo_fast
> state UP qlen 1000
> link/ether 08:00:27:07:fd:fd brd ff:ff:ff:ff:ff:ff
> inet xxx.xxx.xxx.235/24 brd xxx.xxx.xxx.255 scope global eth0
> inet6 fe80::a00:27ff:fe07:fdfd/64 scope link
>valid_lft forever preferred_lft forever
>  
> 
> You need at least a
> interface that is up to bind that IP, typically it is already configured
> with an static IP in the same network as the cluster ip.
> 
> Regards,
> Andreas
> 
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
> >
> >
> > I have been searching through the google, but can't find the right
> > solution for my problem.
> > I have stopped the firewall and disabled the SElinux.
> > Any help would be appreciated.
> >
> > --
> > Best Regards,
> >
> > Soni Maula Harriz
> > Database Administrator
> > PT. Data Aksara Sangkuriang
> >
> >
> >
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> <mailto:Pacemaker@oss.clusterlabs.org>
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pd

Re: [Pacemaker] Resource agent IPaddr2 failed to start

2012-10-09 Thread Andreas Kurz

On 10/09/2012 10:39 AM, Soni Maula Harriz wrote:
> Dear all,
> 
> I'm a newbie in clustering. I have been following the 'Cluster from
> scratch' tutorial.
> I use Centos 6.3 and install pacemaker and corosync from : yum install
> pacemaker corosync
> 
> This is the version i got
> Pacemaker 1.1.7-6.el6
> Corosync Cluster Engine, version '1.4.1'
> 
> This time i have been this far : adding IPaddr2 resource
> (http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_adding_a_resource.html)
> everything goes well before adding IPaddr2 resource.
> when i run 'crm status', it print out
> 
> 
> Last updated: Tue Oct  9 14:58:30 2012
> Last change: Tue Oct  9 13:53:41 2012 via cibadmin on cluster1
> Stack: openais
> Current DC: cluster1 - partition with quorum
> Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> 
> 
> Online: [ cluster1 cluster2 ]
> 
> 
> Failed actions:
> ClusterIP_start_0 (node=cluster1, call=3, rc=6, status=complete):
> not configured
> 
> 
> This is the error i got from /var/log/message
> Oct  8 17:07:16 cluster2 IPaddr2(ClusterIP)[15969]: ERROR:
> [/usr/lib64/heartbeat/findif -C] failed
> Oct  8 17:07:16 cluster2 crmd[15937]:  warning: status_from_rc: Action 4
> (ClusterIP_start_0) on cluster2 failed (target: 0 vs. rc: 6): Error
> Oct  8 17:07:16 cluster2 pengine[15936]:error: unpack_rsc_op:
> Preventing ClusterIP from re-starting anywhere in the cluster :
> operation start failed 'not configured' (rc=6)

Not enough information  please share your configuration (crm
configure show) and the result of "ip a sh". You need at least a
interface that is up to bind that IP, typically it is already configured
with an static IP in the same network as the cluster ip.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now
> 
> 
> I have been searching through the google, but can't find the right
> solution for my problem.
> I have stopped the firewall and disabled the SElinux.
> Any help would be appreciated.
> 
> -- 
> Best Regards,
> 
> Soni Maula Harriz
> Database Administrator
> PT. Data Aksara Sangkuriang
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Active/Active Clustering using GFS2 in Centos-6.2

2012-10-04 Thread Andreas Kurz

On 09/19/2012 10:51 AM, ecfgijn wrote:
> Hi All ,
> 
> I have configured active/active using pacemaker in centos-6.2 along with
> gfs2. Below are configuration.
> 
> crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 params
> ip="169.144.106.121" cidr_netmask="26" op monitor interval=30s
> crm configure primitive WebSite ocf:heartbeat:apache params
> configfile=/etc/httpd/conf/httpd.conf op monitor interval=1min
> crm configure op_defaults timeout=240s
> crm configure colocation website-with-ip INFINITY: WebSite ClusterIP
> 
> crm
> cib new GFS2
> configure primitive WebFS ocf:heartbeat:Filesystem params
> device="/dev/mydisk" directory="/var/www/html" fstype="gfs2"
> configure colocation WebSite-with-WebFS inf: WebSite WebFS
> configure colocation fs_on_drbd inf: WebFS WebDataClone:Master
> configure order WebFS-after-WebData inf: WebDataClone:promote WebFS:start
> configure order WebSite-after-WebFS inf: WebFS WebSite
> configure show
> cib commit GFS2
> 
> 
> 
> crm
> cib new active
> configure clone WebIP ClusterIP  meta globally-unique="true"
> clone-max="2" clone-node-max="2"
> configure edit  ClusterIP
> 
> i have add the following to the params line
> 
> clusterip_hash="sourceip"
> 
> crm
> cib new active
> configure clone WebIP ClusterIP  meta globally-unique="true"
> clone-max="2" clone-node-max="2"
> configure show
> 
> configure clone WebFSClone WebFS
> configure clone WebSiteClone WebSite
> *configure edit WebDataClone(but this file is empty!!!)

You have never defined a resource called WebDataClone ... at least not
in the configuration you showed us.

If you want to use DRBD you must setup and configure DRBD (including
fencing), create a drbd primitive and then create a master-slave
resource from this primitive with master-max=2 ... If you have a shared
storage you would typically create a clusterd volume group in
combination with clvmd and integrate that into your pacemaker
configuration ... without that master-slave magic which is only needed
by DRBD.

For dual-primary DRBD follow:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> *
> We need to Change master-max to 2
> 
> cib commit active
> quit
> 
> 
> But i am facing issues ,
> 
> 1. Not able to mount gfs2 file system simultaneously on both the nodes.
> 
> 2. When i run "configure colocation fs_on_drbd inf: WebFS
> WebDataClone:Master" & "configure order WebFS-after-WebData inf:
> WebDataClone:promote WebFS:start" , it's gives me error , it doesn't 
> find the "fs_on_drbd" & it asks still want to continue(y/n).
> 
> 2. When i try to configure webdataclone , i don't find any entry like
> "master-max" , file is empty.
> 
> 3. As i have done lot of googling and try to find documentation which
> covers  pacemaker+cman+gfs2 for centos-6, but still no success . every
> documentation link includes drbd first and then gfs.
> 
> 
> Also find the attached node xml file
> 
> 
> Regards
> Pradeep Kumar
> 
> 
> 
> 
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] bug in monitor timeout?

2012-10-04 Thread Andreas Kurz

On 10/04/2012 12:18 PM, James Harper wrote:
>> Hi,
>>
>> On Wed, Oct 03, 2012 at 10:07:06PM +, James Harper wrote:
>>> It seems like everytime I modify a resource, things start timing out. Just
>>> now I changed the location of where a ping resource could run and this
>>> happened:
>>> Oct  4 07:07:07 bitvs5 lrmd: [3681]: WARN: perform_ra_op: the
>>> operation monitor[52] on p_lvm_iscsi:0 for client 3686 stayed in
>>> operation list for 22000 ms (longer than 1 ms)
>>
>> That's interesting. Normally such a change should result in just a few
>> operations. Did you take a look at the transition which resulted from this
>> change?
> 
> I don't think I know how to do that. All I changed though was a ping resource 
> that wasn't (yet) in use so I can't see that too much could have changed.
> 
>>> Another oddity is that the resource for p_lvm_iscsi is defined as:
>>>
>>> primitive p_lvm_iscsi ocf:heartbeat:LVM \
>>> params volgrpname="vg-drbd" \
>>> op start interval="0" timeout="30s" \
>>> op stop interval="0" timeout="30s" \
>>> op monitor interval="10s" timeout="30s"
>>>
>>> so I don't know where the timeout of 1ms is coming from??
>>>
>>> When I change something with crm configure the cib process shoots up to
>>> 100% CPU and stays there for a while, and the node becomes more-or-less
>>> unresponsive, which may go some way to explaining why things time out. Is
>>> this normal? It doesn't explain why lrmd complains that something took
>>> longer than 10s when I set the timeout to 30s though, unless the interval
>>> somehow interacts with that?
>>
>> Ten seconds is an ad-hoc time and has nothing to do with specific timeouts.
>> lrmd logs a warning if an operation stays in the queue for longer than that.
> 
> Ah. I leaped to the wrong conclusion there then.
> 
>> How many resources do you have? You can also increase max-children (a
>> lrmd parameter), which is a number of operations that lrmd is allowed to run
>> concurrently (lrmadmin -p max-children n, by default it's set to 4).
> 
> crm status says:
> 
> Last updated: Thu Oct  4 19:42:30 2012
> Last change: Thu Oct  4 08:21:47 2012
> Stack: openais
> Current DC:  - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 5 Nodes configured, 5 expected votes
> 81 Resources configured.
> 
> I'm not sure how many is considered a lot, but I can't think that 81 would 
> rate very highly.

Yes, that is a lot for Pacemaker ... 5 nodes and each node has a status
section including operation results for every resource ... cib can get
large quite fast.

You tried setting a higher "batch-limit" in your properties? Do you see
any corosync messages when applying such changes? There is a good chance
you also need to tune corosync timings.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now
> 
>>> Versions of software are all from Debian Wheezy:
>>> corosync 1.4.2-3
>>> pacemaker 1.1.7-1
>>
>> I'd suggest to open a bugzilla and include hb_report (or crm_report,
>> whatever your distribution ships).
>>
> 
> I plan on doing a bit more testing on the weekend to see if I can find a bit 
> more information about exactly what is going on, otherwise any bug report is 
> going to be a bit vague.
> 
> Thanks
> 
> James
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] recovering from split brain

2012-10-04 Thread Andreas Kurz

On 10/04/2012 12:03 AM, Jane Du (jadu) wrote:
> Re-send. Will be appreciate if someone could shed some light on this.
> 
> -Original Message-
> From: Jane Du (jadu) 
> Sent: Friday, September 28, 2012 4:53 PM
> To: The Pacemaker cluster resource manager
> Subject: [Pacemaker] recovering from split brain
> 
> Hi,
> Use heartbeat and pacemaker for two node HA. We don't configure the fencing 
> or quorum because we let 
> one of our module handle the split-brain recovering based on some important 
> factors. However,
> when recovering from two-node split brain situation, CRM still tries to elect 
> a node to be active to run the resource and sometimes
> the decision is conflict with what our master module makes and triggers an 
> unnecessary switch over.
> Is there a parameter that I can program to delay CRM to make the decision or 
> not to make decision when recovering from split brain?
> 

Hard to say without logs/configuration  ... You mean the resources are
_not_ running on both nodes while cluster is in split-brain? ... you set
resource-stickiness?

Regards,
Andreas
-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Thanks for your help,
> 
> Jane
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] apache on too many nodes?

2012-10-04 Thread Andreas Kurz

On 10/03/2012 10:47 PM, Jake Smith wrote:
> 
> 
> 
> - Original Message -
>> From: mar...@nic.fi
>> To: pacemaker@oss.clusterlabs.org
>> Sent: Wednesday, October 3, 2012 4:36:10 PM
>> Subject: [Pacemaker] apache on too many nodes?
>>
>> Hello,
>>
>> I'm currently testing out a 2 node system with the simple goal of
>> providing failover so that the services are run on one node at a
>> time.
>> I have some of the mysql and apache configuration files in the drbd
>> partition.
>>
>> Without apache everything works as expected drbd and mysql start and
>> stop properly on the nodes, but as soon as i try to put apache into
>> to mix nothing works anymore.
>>
>> Using the following configuration:
>>
>> primitive drbd ocf:linbit:drbd params drbd_resource="drbd" op start
>> interval="0" timeout="240s" op stop interval="0" timeout="100s"
>> ms drbd_ms drbd meta master-max="1" master-node-max="1" clone-max="2"
>> clone-node-max="1" notify="true"
>> primitive drbd_fs ocf:heartbeat:Filesystem params device="/dev/drbd0"
>> directory="/mnt/drbd" fstype="ext3" op start interval="0"
>> timeout="60" op stop interval="0" timeout="120"
>> primitive service_mysqld lsb:mysql op monitor interval="15s"
>> primitive service_apache lsb:apache2 op monitor interval="15s"
>> group services_group service_mysqld service_apache
>> colocation lamp_services inf: drbd_fs drbd_ms:Master services_group
> 
> Pretty sure your colocation is the problem - services_group should be first 
> not last...
> 
> colocation lamp_services inf: services_group drbd_fs drbd_ms:Master

It has to be:

colocation lamp_services inf: drbd_fs services_group drbd_ms:Master

... if you really want to use resource-sets, I'd recommend putting the
fs into the services_group on the first position to avoid confusion and use:

colocation lamp_services inf: services_group drbd_ms:Master

You disabled automatic start via init for apache & mysql? You checked
the apache LSB script for LSB compliance?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> because you want the services group to run on the same node as the drbd 
> filesystem which has to run on the same node as the drbd Master.
> 
>> order lamp_order inf: drbd_ms:promote drbd_fs:start
>> services_group:start
>>
>> I think it wants to start apache on both nodes for some reason (and
>> fails ofcourse) and promptly stops everything.
>>
>> service_apache (lsb:apache2) Started (unmanaged) FAILED[ node1 node2
>> ]
>>
>> I'm probably doing something stupid here. Using debian squeeze.
>>
>> :Mrv
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] staggered startup

2012-09-06 Thread Andreas Kurz

On 09/05/2012 04:08 PM, James Harper wrote:
> A power failure tonight indicated that my clustered resources (xen vm's) have 
> a dependency requirement like "make sure at least one domain controller VM is 
> fully up and running before starting any other windows servers". Determining 
> a status of "fully up and running" is probably complex so as a minimum I need 
> to say "make sure resource1 has been started for 60 seconds before starting 
> resource2".
> 
> Is this possible?

The VirtualDomain ra has the possibility to define additional monitor
scripts. You can use any script you like to check if services inside a
vm are up e.g. trying to connect to a smtp service inside the VM  be
sure to use generous start timeout values.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Thanks
> 
> James
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Did I miss a dependency or is this a 1.1.7 bug?

2012-08-01 Thread Andreas Kurz

On 08/01/2012 06:51 AM, mark - pacemaker list wrote:
> Hello,
> 
> I suspect I've missed a dependency somewhere in the build process, and
> I'm hoping someone recognizes this as an easy fix.  I've basically
> followed the build guide on ClusterLabs in 'Clusters from Scratch v2',
> the build from source section.  The hosts are CentOS 6.3 x86_64.  My

Pacemaker 1.1.7 is included in CentOS 6.3 no need to build it for
yourself... or do you try to build latest git version?

Regards,
Andreas

> only changes to the configure lines where adding '--sysconfdir=/etc
> --localstatedir=/var' to them, everything else is pretty much as the
> guide shows.
> 
> With a few commands, such as 'crm configure verify', I get this error
> message:
> 
> # crm configure verify
> crm_verify[1448]: 2012/07/31_23:37:41 ERROR: crm_abort:
> gregorian_to_ordinal: Triggered assert at iso8601.c:635 : a_date->days > 0
> crm_verify[1448]: 2012/07/31_23:37:41 ERROR: crm_abort:
> convert_from_gregorian: Triggered assert at iso8601.c:622 :
> gregorian_to_ordinal(a_date)
> 
> These are accompanied by a corresponding entry in
> /var/log/cluster/corosync.log:
> Jul 31 23:35:25 XenC crmd: [1063]: ERROR: crm_abort:
> gregorian_to_ordinal: Triggered assert at iso8601.c:635 : a_date->days > 0
> Jul 31 23:35:25 XenC crmd: [1063]: ERROR: crm_abort:
> convert_from_gregorian: Triggered assert at iso8601.c:622 :
> gregorian_to_ordinal(a_date)
> 
> So far the cluster seems functional and logs appear clean otherwise, so
> I don't know if those are cosmetic or if I have a whole field of
> landmines due to some missed dependency.  I've pastebin'ed my notes from
> building these if that helps at all:  pastebin.com/c5BTMA0q
>    
> 
> Thanks,
> Mark
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] /var/lib/pengine folder consuming space over time

2012-07-26 Thread Andreas Kurz

On 07/20/2012 04:05 PM, ihjaz Mohamed wrote:
> Am using  pacemaker-1.1.5-5 and I see about 21000 input files in the
> folder. Does that mean this version by default has no such limits?

yes ... IIRC limits are set since 1.1.6

Regards,
Andreas

> 
> ----
> *From:* Andreas Kurz 
> *To:* pacemaker@oss.clusterlabs.org
> *Sent:* Friday, 20 July 2012 5:57 PM
> *Subject:* Re: [Pacemaker] /var/lib/pengine folder consuming space over time
> 
> On 07/20/2012 02:18 PM, ihjaz Mohamed wrote:
>> Hi All,
>>
>> I see that the folder /var/lib/pengine is consuming space over time.
>>
>> Is there a configuration to limit the size of the data logged by pengine
>> so that once it reaches this limit the older ones get removed.
> 
> the cluster properties to limit them (more) are:
> 
> pe-error-series-max
> pe-warn-series-max
> pe-input-series-max
> 
> ... newer Pacemaker has limits on warn (5000) and input (4000) pengine
> files.
> 
> Regards,
> Andreas
> 
> -- 
> Need help with Pacemaker?
> http://www.hastexo.com/now
> 
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> <mailto:Pacemaker@oss.clusterlabs.org>
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>>
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> <mailto:Pacemaker@oss.clusterlabs.org>
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Specifying a location rule with a resource:role

2012-07-25 Thread Andreas Kurz

On 07/24/2012 09:03 PM, Jay Janssen wrote:
> Hi all,
>   Simple crm configure syntax question (I think).  I'm trying to make a
> given role (a 'Master' in a master/slave set) from being location on a
> given node.  However, the "obvious" syntax (at least in my mind) doesn't
> work:
> 
> location avoid_being_the_master ms_MySQL:Master -1000: my_node
> location never_be_the_master ms_MySQL:Master -inf: my_node

This notation is a special shortcut and does not allow setting the role
... try:

location avoid_being_the_master ms_MySQL \
  rule $role=Master -1000: #uname eq my_node
location never_be_the_master ms_MySQL \
  rule $role=Master -inf: #uname eq my_node

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
>   This man page tells me I'm
> close: 
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch06s07.html
>  ,
> I suspect it's the :Master rule that is screwing me up.  How can I
> specify the Master role?  The following colocation rule works fine:
> 
> colocation writer_vip_on_master inf: writer_vip ms_MySQL:Master
> 
> Jay Janssen, MySQL Consultant, Percona Inc.
> Telephone: +1 888 401 3401 ext 563 (US/Eastern)
> Emergency:+1 888 401 3401 ext 911
> Skype:percona.jayj  
> GTalk/MSN: jay.jans...@percona.com 
> YahooIM:perconajayj
> Calendar: http://tungle.me/jayjanssen
> 
> Percona Live in NYC Oct 1-2nd: http://www.percona.com/live/nyc-2012/
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] delaying stonith

2012-07-24 Thread Andreas Kurz

On 07/24/2012 11:02 AM, Frank Van Damme wrote:
> 2012/7/23 Andreas Kurz :
>> You are using stonith-aciton="poweroff" with external/ipmi? That would
>> explain this 4s during which the System does a powerdown caught by
>> acpid. Use stonith-aciton="reset" when using external/ipmi ... this
>> should bring down the STONITH target nearly immediately, whithout acpid
>> taking notice.
> 
> there are only two values, reboot and poweroff - with reboot being the 
> default.

true .. meant "reboot". So you use the default and still have this problem?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] delaying stonith

2012-07-23 Thread Andreas Kurz

On 07/23/2012 10:24 AM, Frank Van Damme wrote:
> Hello,
> 
> I had an interesting conversation on irc about clustering and fencing,
> and I was told it is possible to delay the triggering of a stonith
> action by a number of seconds. I searched, but I can't really find how
> to configure it.
> 
> (The problem I am trying to solve is that the nodes in a two-node
> cluster run the risk of killing each other if connectivity is lost.
> Stonith takes a bit of time to execute - I'm using external/ipmi to
> reset nodes and it seems that his action is the equivalent of holding
> down the power button, which takes four seconds, during which the
> other machine has time to start fencing the first node.)

You are using stonith-aciton="poweroff" with external/ipmi? That would
explain this 4s during which the System does a powerdown caught by
acpid. Use stonith-aciton="reset" when using external/ipmi ... this
should bring down the STONITH target nearly immediately, whithout acpid
taking notice.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] None of the standard agents in ocf:heartbeat are working in centos 6

2012-07-23 Thread Andreas Kurz

On 07/23/2012 07:06 AM, David Barchas wrote:
> Hello.
> 
> I have been working on this for 3 days now, and must be so stressed out
> that I am being blinded to what is probably an obvious cause of this. In
> a word, HELP.
> 
> I am trying specifically to utilize ocf:heartbeat:IPaddr2, but this
> issue seems to occur with any of the ocf:heartbeat agents. I will just
> focus on IPaddr2 for purposes of figuring this out, but it happens
> exactly the same with any of the default agents. However, I can
> successfully use ocf:linbit:drbd for example. it seems to be limited to
> the RAs that are installed along with coro/pace in the resource-agents
> package.

What are the exact package versions you have installed?

pacemaker*
resource-agents
cluster-glue*


> 
> I am using CentOS 6.3, fully updated (though this happens in 6.2 with no
> updates as well). Install pacemaker/coro from default repo. I have
> stripped everything down to figure this out in vmware and just install
> centos, update it, install pace/coro (no drbd for this discussion),
> configure coro, and then start it. pacemaker starts up fine (or at least
> I think its fine). I can set quorum ignore for example from crm. (crm
> configure property no-quorum-policy="ignore")
> 
> here is the process list
> root  1447  0.3  0.6 556080  6636 ?Ssl  21:09   0:00 corosync
> 499   1453  0.0  0.5  88720  5556 ?S21:09   0:00  \_
> /usr/libexec/pacemaker/cib
> root  1454  0.0  0.3  86968  3488 ?S21:09   0:00  \_
> /usr/libexec/pacemaker/stonithd
> root  1455  0.0  0.2  76188  2492 ?S21:09   0:00  \_
> /usr/lib64/heartbeat/lrmd
> 499   1456  0.0  0.3  91160  3432 ?S21:09   0:00  \_
> /usr/libexec/pacemaker/attrd
> 499   1457  0.0  0.3  87440  3824 ?S21:09   0:00  \_
> /usr/libexec/pacemaker/pengine
> 499   1458  0.0  0.3  91312  3884 ?S21:09   0:00  \_
> /usr/libexec/pacemaker/crmd

so you are using plugin version 0 to start Pacemaker  That would
explain why /etc/init.d/pacemaker is unable to start ... it is already
started by Corosync.

> 
> 499 is hacluster btw.
> 
> ***BUT***
> 
> When I run as root the following:
> # crm ra meta ocf:heartbeat:IPaddr2
> 
> I get this response:
> lrmadmin[1484]: 2012/07/22_13:28:23 ERROR:
> lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply
> message of rmetadata with function get_ret_from_msg.
> ERROR: ocf:heartbeat:IPaddr2: could not parse meta-data: 
> 
> And this is in /var/log/messages:
> Jul 22 16:35:14 MST lrmd: [48093]: ERROR: get_resource_meta: pclose
> failed: Resource temporarily unavailable
> Jul 22 16:35:14 MST lrmd: [48093]: WARN: on_msg_get_metadata: empty
> metadata for ocf::heartbeat::IPaddr2.
> Jul 22 16:35:14 MST lrmd: [48093]: WARN: G_SIG_dispatch: Dispatch
> function for SIGCHLD was delayed 200 ms (> 100 ms) before being called
> (GSource: 0x187df10)
> Jul 22 16:35:14 MST lrmd: [48093]: info: G_SIG_dispatch: started at
> 429616889 should have started at 429616869
> Jul 22 16:35:14 MST lrmadmin: [48254]: ERROR:
> lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply
> message of rmetadata with function get_ret_from_msg.
> 
> I am using crm ra meta as a way to test, but crm will not accept my
> trying to add the resource as a primitive either.
> 
> In my research, I have found that often it's permissions. So just to
> rule that out i set my entire system to 777 permissions. no joy.
> 
> Another suggestion i find often has been to set OCF_ROOT (export
> OCF_ROOT=/usr/lib/ocf) and then do
> /usr/lib/ocf/resource.d/heartbeat/IPaddr2 meta-data.
> That produces the desired output. But does not work before i export. 
> And CRM still does not accept my meta request 
> 
> Another suggestion i find is to make sure that shellfuncs exists in the
> agents folder. the soft links exist
> lrwxrwxrwx. 1 root root32 Jul 22 04:08 .ocf-binaries ->
> ../../lib/heartbeat/ocf-binaries
> lrwxrwxrwx. 1 root root35 Jul 22 04:08 .ocf-directories ->
> ../../lib/heartbeat/ocf-directories
> lrwxrwxrwx. 1 root root35 Jul 22 04:08 .ocf-returncodes ->
> ../../lib/heartbeat/ocf-returncodes
> lrwxrwxrwx. 1 root root34 Jul 22 04:08 .ocf-shellfuncs ->
> ../../lib/heartbeat/ocf-shellfuncs
> 
> And just to make sure I did un-hidden soft links as well with no joy.

Strange, that errors are typically related to wrong paths for
initialization of environment and helper functions:

# Initialization:

: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs

DRBD agent has an extra failback check, that may be the reason that it
still works ...

# Resource-agents have moved their ocf-shellfuncs file around.
# There are supposed to be symlinks or wrapper files in the old location,
# pointing to the new one, but people seem to get it wrong all the time.
# Try several locations.

if test -n "${OCF_FUNCTIONS_DIR}" ; then
if test -e "${OCF_FUNCTIONS_DIR}/ocf-s

Re: [Pacemaker] /var/lib/pengine folder consuming space over time

2012-07-20 Thread Andreas Kurz

On 07/20/2012 02:18 PM, ihjaz Mohamed wrote:
> Hi All,
> 
> I see that the folder /var/lib/pengine is consuming space over time.
> 
> Is there a configuration to limit the size of the data logged by pengine
> so that once it reaches this limit the older ones get removed.

the cluster properties to limit them (more) are:

pe-error-series-max
pe-warn-series-max
pe-input-series-max

... newer Pacemaker has limits on warn (5000) and input (4000) pengine
files.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] mysql resource agent

2012-07-19 Thread Andreas Kurz

On 07/18/2012 03:24 PM, DENNY, MICHAEL wrote:
> Our current monitor action tests the availability of the mysql database.  
> However, the monitor fails if mysql is doing recovery processing.   And the 
> recovery processing can take a long time.   Do you know if there is  a way to 
> programmatically determine if mysql is in recovery mode (and is processing 
> the log entries)?  or some existing utility prog that can report that mysql 
> is started but in recovery mode?  I want to update the monitor action to 
> succeed if in recovery mode.
> 
> Backgrounder:
> - We using mysql on Ubuntu servers using heartbeat/pacemaker in an 
> Active/Passive redundant pair configuration.
> - When simulating failover, we have observed that the start action on the new 
> server will succeed, exit almost immediately, although mysql has actually 
> begun recovery processing and subsequent "service mysql status" reports that 
> it is not available for use.
> - Our current monitor action fails if mysql is not available for use.
> 

You are using the LSB script of mysql? The mysql OCF resource agent
returns on start only after recovery is finished ... so be sure to
define a generous start timeout.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> Thanks for the help,
> 
> mike
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 







signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Pacemaker 1.1.7 order constraint syntax

2012-07-19 Thread Andreas Kurz

On 07/19/2012 02:57 PM, Rasto Levrinc wrote:
> On Thu, Jul 19, 2012 at 2:38 PM, Andreas Kurz  wrote:
>> On 07/19/2012 11:47 AM, Vadym Chepkov wrote:
>>> Hi,
>>>
>>> When Pacemaker 1.1.7 was announced, a new feature was mentioned:
>>>
>>> The ability to specify that A starts after ( B or C or D )
>>>
>>> I wasn't able to find an example how to express it crm shell in neither man 
>>> crm nor in Pacemaker Explained.
>>> In fact, 
>>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-ordering.html
>>>  doesn't have new attribute listed either.
>>> Is it supported in crm ?
>>
>> I don't think it is supported in crm or any other configuration tool
>>  syntax for above example in xml looks like:
> 
> Well, LCMC supports this, btw last time I checked this feature
> is still not enabled in constrains rng in 1.1.7 by default, so you
> have to wait at least for 1.1.8, or enable it yourself.
> It also doesn't work if combined with colocation.

cool Rasto :-) ... does LCMC also already support the new multiple
stonith-device configuration syntax ... fencing-topology ...?

Cheers,
Andreas

> 
> Rasto
> 
>>
>> 
>>   
>> 
>> 
>> 
>>   
>>   
>> 
>>   
>> 
>>
>> ... can be found in the pengine regression tests directory in Pacemaker
>> source ...
> 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Pacemaker 1.1.7 order constraint syntax

2012-07-19 Thread Andreas Kurz

On 07/19/2012 11:47 AM, Vadym Chepkov wrote:
> Hi,
> 
> When Pacemaker 1.1.7 was announced, a new feature was mentioned:
> 
> The ability to specify that A starts after ( B or C or D )
> 
> I wasn't able to find an example how to express it crm shell in neither man 
> crm nor in Pacemaker Explained.
> In fact, 
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-ordering.html
>  doesn't have new attribute listed either.
> Is it supported in crm ?

I don't think it is supported in crm or any other configuration tool
 syntax for above example in xml looks like:


  



  
  

  


... can be found in the pengine regression tests directory in Pacemaker
source ...

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Thanks,
> Vadym
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 







signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-11 Thread Andreas Kurz

On 07/11/2012 09:23 AM, Nikola Ciprich wrote:
>>> It really really looks like Pacemaker is too fast when promoting to
>>> primary ... before the connection to the already up second node can be
>>> established.
>>
>> Do you mean we're violating a constraint?
>> Or is it a problem of the RA returning too soon?
> dunno, I tried older drbd userspaces to check if it's not problem
> of newer RA, to no avail...
> 
>>
>>> I see in your logs you have DRBD 8.3.13 userland  but
>>> 8.3.11 DRBD module installed ... can you test with 8.3.13 kernel module
>>> ... there have been fixes that look like addressing this problem.
> tried 8.3.13 userspace + 8.3.13 module (on top of 3.0.36 kernel), 
> unfortunately same result..
> 
>>>
>>> Another quick-fix, that should also do: add a start-delay of some
>>> seconds to the start operation of DRBD
>>>
>>> ... or fix your after-split-brain policies to automatically solve this
>>> special type of split-brain (with 0 blocks to sync).
> I'll try that, although I'd not like to use this for production :)

Well, I'd expect that to be safer as your current configuration ...
discard-zero-changes will never overwrite data automatically  have
you tried adding the start-delay to DRBD start operation? I'm curious if
that is already sufficient for your problem.

Regards,
Andreas

> 
>>>
>>> Best Regards,
>>> Andreas
>>>
>>> --
>>> Need help with Pacemaker?
>>> http://www.hastexo.com/now
>>>
>>>>
>>>> thanks for Your time.
>>>> n.
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Andreas
>>>>>
>>>>> --
>>>>> Need help with Pacemaker?
>>>>> http://www.hastexo.com/now
>>>>>
>>>>>>
>>>>>> thanks a lot in advance
>>>>>>
>>>>>> nik
>>>>>>
>>>>>>
>>>>>> On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
>>>>>>> On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
>>>>>>>> hello,
>>>>>>>>
>>>>>>>> I'm trying to solve quite mysterious problem here..
>>>>>>>> I've got new cluster with bunch of SAS disks for testing purposes.
>>>>>>>> I've configured DRBDs (in primary/primary configuration)
>>>>>>>>
>>>>>>>> when I start drbd using drbdadm, it get's up nicely (both nodes
>>>>>>>> are Primary, connected).
>>>>>>>> however when I start it using corosync, I always get split-brain, 
>>>>>>>> although
>>>>>>>> there are no data written, no network disconnection, anything..
>>>>>>>
>>>>>>> your full drbd and Pacemaker configuration please ... some snippets from
>>>>>>> something are very seldom helpful ...
>>>>>>>
>>>>>>> Regards,
>>>>>>> Andreas
>>>>>>>
>>>>>>> --
>>>>>>> Need help with Pacemaker?
>>>>>>> http://www.hastexo.com/now
>>>>>>>
>>>>>>>>
>>>>>>>> here's drbd resource config:
>>>>>>>> primitive drbd-sas0 ocf:linbit:drbd \
>>>>>>>> params drbd_resource="drbd-sas0" \
>>>>>>>> operations $id="drbd-sas0-operations" \
>>>>>>>> op start interval="0" timeout="240s" \
>>>>>>>> op stop interval="0" timeout="200s" \
>>>>>>>> op promote interval="0" timeout="200s" \
>>>>>>>> op demote interval="0" timeout="200s" \
>>>>>>>> op monitor interval="179s" role="Master" timeout="150s" \
>>>>>>>> op monitor interval="180s" role="Slave" timeout="150s"
>>>>>>>>
>>>>>>>> ms ms-drbd-sas0 drbd-sas0 \
>>>>>>>>meta clone-max="2" clone-node-max="1" master-max="2" 
>>>>>>>> master-node-max="1" notify="true" globally-unique="

Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-11 Thread Andreas Kurz

On 07/11/2012 04:50 AM, Andrew Beekhof wrote:
> On Wed, Jul 11, 2012 at 8:06 AM, Andreas Kurz  wrote:
>> On Tue, Jul 10, 2012 at 8:12 AM, Nikola Ciprich
>>  wrote:
>>> Hello Andreas,
>>>> Why not using the RA that comes with the resource-agent package?
>>> well, I've historically used my scripts, haven't even noticed when LVM
>>> resource appeared.. I switched to it now.., thanks for the hint..
>>>> this "become-primary-on" was never activated?
>>> nope.
>>>
>>>
>>>> Is the drbd init script deactivated on system boot? Cluster logs should
>>>> give more insights 
>>> yes, it's deactivated. I tried resyncinc drbd by hand, deleted logs,
>>> rebooted both nodes, checked drbd ain't started and started corosync.
>>> result is here:
>>> http://nelide.cz/nik/logs.tar.gz
>>
>> It really really looks like Pacemaker is too fast when promoting to
>> primary ... before the connection to the already up second node can be
>> established.
> 
> Do you mean we're violating a constraint?
> Or is it a problem of the RA returning too soon?

It looks like a RA problem ... notifications after the start of the
resource and the following promote are very fast and DRBD is still not
finished with establishing the connection to the peer. I can't remember
seeing this before.

Regards,
Andreas

> 
>> I see in your logs you have DRBD 8.3.13 userland  but
>> 8.3.11 DRBD module installed ... can you test with 8.3.13 kernel module
>> ... there have been fixes that look like addressing this problem.
>>
>> Another quick-fix, that should also do: add a start-delay of some
>> seconds to the start operation of DRBD
>>
>> ... or fix your after-split-brain policies to automatically solve this
>> special type of split-brain (with 0 blocks to sync).
>>
>> Best Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>>
>>> thanks for Your time.
>>> n.
>>>
>>>
>>>>
>>>> Regards,
>>>> Andreas
>>>>
>>>> --
>>>> Need help with Pacemaker?
>>>> http://www.hastexo.com/now
>>>>
>>>>>
>>>>> thanks a lot in advance
>>>>>
>>>>> nik
>>>>>
>>>>>
>>>>> On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
>>>>>> On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
>>>>>>> hello,
>>>>>>>
>>>>>>> I'm trying to solve quite mysterious problem here..
>>>>>>> I've got new cluster with bunch of SAS disks for testing purposes.
>>>>>>> I've configured DRBDs (in primary/primary configuration)
>>>>>>>
>>>>>>> when I start drbd using drbdadm, it get's up nicely (both nodes
>>>>>>> are Primary, connected).
>>>>>>> however when I start it using corosync, I always get split-brain, 
>>>>>>> although
>>>>>>> there are no data written, no network disconnection, anything..
>>>>>>
>>>>>> your full drbd and Pacemaker configuration please ... some snippets from
>>>>>> something are very seldom helpful ...
>>>>>>
>>>>>> Regards,
>>>>>> Andreas
>>>>>>
>>>>>> --
>>>>>> Need help with Pacemaker?
>>>>>> http://www.hastexo.com/now
>>>>>>
>>>>>>>
>>>>>>> here's drbd resource config:
>>>>>>> primitive drbd-sas0 ocf:linbit:drbd \
>>>>>>> params drbd_resource="drbd-sas0" \
>>>>>>> operations $id="drbd-sas0-operations" \
>>>>>>> op start interval="0" timeout="240s" \
>>>>>>> op stop interval="0" timeout="200s" \
>>>>>>> op promote interval="0" timeout="200s" \
>>>>>>> op demote interval="0" timeout="200s" \
>>>>>>> op monitor interval="179s" role="Master" timeout="150s" \
>>>>>>> op monitor interval="180s" role="Slave" timeout="150s"
>>>>>>>
>>>>>>> ms ms-drbd-sas0 drb

Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-10 Thread Andreas Kurz

On Tue, Jul 10, 2012 at 8:12 AM, Nikola Ciprich
 wrote:
> Hello Andreas,
>> Why not using the RA that comes with the resource-agent package?
> well, I've historically used my scripts, haven't even noticed when LVM
> resource appeared.. I switched to it now.., thanks for the hint..
>> this "become-primary-on" was never activated?
> nope.
>
>
>> Is the drbd init script deactivated on system boot? Cluster logs should
>> give more insights 
> yes, it's deactivated. I tried resyncinc drbd by hand, deleted logs,
> rebooted both nodes, checked drbd ain't started and started corosync.
> result is here:
> http://nelide.cz/nik/logs.tar.gz

It really really looks like Pacemaker is too fast when promoting to
primary ... before the connection to the already up second node can be
established.  I see in your logs you have DRBD 8.3.13 userland  but
8.3.11 DRBD module installed ... can you test with 8.3.13 kernel module
... there have been fixes that look like addressing this problem.

Another quick-fix, that should also do: add a start-delay of some
seconds to the start operation of DRBD

... or fix your after-split-brain policies to automatically solve this
special type of split-brain (with 0 blocks to sync).

Best Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

>
> thanks for Your time.
> n.
>
>
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>> >
>> > thanks a lot in advance
>> >
>> > nik
>> >
>> >
>> > On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
>> >> On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
>> >>> hello,
>> >>>
>> >>> I'm trying to solve quite mysterious problem here..
>> >>> I've got new cluster with bunch of SAS disks for testing purposes.
>> >>> I've configured DRBDs (in primary/primary configuration)
>> >>>
>> >>> when I start drbd using drbdadm, it get's up nicely (both nodes
>> >>> are Primary, connected).
>> >>> however when I start it using corosync, I always get split-brain, 
>> >>> although
>> >>> there are no data written, no network disconnection, anything..
>> >>
>> >> your full drbd and Pacemaker configuration please ... some snippets from
>> >> something are very seldom helpful ...
>> >>
>> >> Regards,
>> >> Andreas
>> >>
>> >> --
>> >> Need help with Pacemaker?
>> >> http://www.hastexo.com/now
>> >>
>> >>>
>> >>> here's drbd resource config:
>> >>> primitive drbd-sas0 ocf:linbit:drbd \
>> >>> params drbd_resource="drbd-sas0" \
>> >>> operations $id="drbd-sas0-operations" \
>> >>> op start interval="0" timeout="240s" \
>> >>> op stop interval="0" timeout="200s" \
>> >>> op promote interval="0" timeout="200s" \
>> >>> op demote interval="0" timeout="200s" \
>> >>> op monitor interval="179s" role="Master" timeout="150s" \
>> >>> op monitor interval="180s" role="Slave" timeout="150s"
>> >>>
>> >>> ms ms-drbd-sas0 drbd-sas0 \
>> >>>meta clone-max="2" clone-node-max="1" master-max="2" 
>> >>> master-node-max="1" notify="true" globally-unique="false" 
>> >>> interleave="true" target-role="Started"
>> >>>
>> >>>
>> >>> here's the dmesg output when pacemaker tries to promote drbd, causing 
>> >>> the splitbrain:
>> >>> [  157.646292] block drbd2: Starting worker thread (from drbdsetup 
>> >>> [6892])
>> >>> [  157.646539] block drbd2: disk( Diskless -> Attaching )
>> >>> [  157.650364] block drbd2: Found 1 transactions (1 active extents) in 
>> >>> activity log.
>> >>> [  157.650560] block drbd2: Method to ensure write ordering: drain
>> >>> [  157.650688] block drbd2: drbd_bm_resize called with capacity == 
>> >>> 584667688
>> >>> [  157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 
>> >>> pages=2231
>> >>> [  157.65

Re: [Pacemaker] Pacemaker cannot start the failed master as a new slave?

2012-07-09 Thread Andreas Kurz

On 07/09/2012 06:11 AM, quanta wrote:
> Related thread:
> http://oss.clusterlabs.org/pipermail/pacemaker/2011-December/012499.html
> 
> I'm going to setup failover for MySQL replication (1 master and 1 slave)
> follow this guide:
> https://github.com/jayjanssen/Percona-Pacemaker-Resource-Agents/blob/master/doc/PRM-setup-guide.rst

and you also use the latest mysql RA from resource-agents github?

> 
> Here're the output of `crm configure show`:
> 
> node serving-6192 \
> attributes p_mysql_mysql_master_IP="192.168.6.192"
> node svr184R-638.localdomain \
> attributes p_mysql_mysql_master_IP="192.168.6.38"
> primitive p_mysql ocf:percona:mysql \
> params config="/etc/my.cnf" pid="/var/run/mysqld/mysqld.pid"
> socket="/var/lib/mysql/mysql.sock" replication_user="repl"
> replication_passwd="x" test_user="test_user" test_passwd="x" \
> op monitor interval="5s" role="Master" OCF_CHECK_LEVEL="1" \
> op monitor interval="2s" role="Slave" timeout="30s"
> OCF_CHECK_LEVEL="1" \
> op start interval="0" timeout="120s" \
> op stop interval="0" timeout="120s"
> primitive writer_vip ocf:heartbeat:IPaddr2 \
> params ip="192.168.6.8" cidr_netmask="32" \
> op monitor interval="10s" \
> meta is-managed="true"
> ms ms_MySQL p_mysql \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true" globally-unique="false"
> target-role="Master" is-managed="true"
> colocation writer_vip_on_master inf: writer_vip ms_MySQL:Master
> order ms_MySQL_promote_before_vip inf: ms_MySQL:promote writer_vip:start
> property $id="cib-bootstrap-options" \
> dc-version="1.0.12-unknown" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> no-quorum-policy="ignore" \
> stonith-enabled="false" \
> last-lrm-refresh="1341801689"
> property $id="mysql_replication" \
> p_mysql_REPL_INFO="192.168.6.192|mysql-bin.06|338"
> 
> `crm_mon`:
> 
> Last updated: Mon Jul  9 10:30:01 2012
> Stack: openais
> Current DC: serving-6192 - partition with quorum
> Version: 1.0.12-unknown
> 2 Nodes configured, 2 expected votes
> 2 Resources configured.
> 
> 
> Online: [ serving-6192 svr184R-638.localdomain ]
> 
>  Master/Slave Set: ms_MySQL
>  Masters: [ serving-6192 ]
>  Slaves: [ svr184R-638.localdomain ]
> writer_vip(ocf::heartbeat:IPaddr2):Started serving-6192
> Editing `/etc/my.cnf` on the serving-6192 of wrong syntax to test
> failover and it's working fine:
> - svr184R-638.localdomain being promoted to become the master
> - writer_vip switch to svr184R-638.localdomain
> 
> Last updated: Mon Jul  9 10:35:57 2012
> Stack: openais
> Current DC: serving-6192 - partition with quorum
> Version: 1.0.12-unknown
> 2 Nodes configured, 2 expected votes
> 2 Resources configured.
> 
> 
> Online: [ serving-6192 svr184R-638.localdomain ]
> 
>  Master/Slave Set: ms_MySQL
>  Masters: [ svr184R-638.localdomain ]
>  Stopped: [ p_mysql:0 ]
> writer_vip(ocf::heartbeat:IPaddr2):Started svr184R-638.localdomain
> 
> Failed actions:
> p_mysql:0_monitor_5000 (node=serving-6192, call=15, rc=7,
> status=complete): not running
> p_mysql:0_demote_0 (node=serving-6192, call=22, rc=7,
> status=complete): not running
> p_mysql:0_start_0 (node=serving-6192, call=26, rc=-2, status=Timed
> Out): unknown exec error
> 
> Remove the wrong syntax from `/etc/my.cnf` on serving-6192, and restart
> corosync, what I would like to see is serving-6192 was started as a new
> slave but it doesn't:
> 
> Failed actions:
> p_mysql:0_start_0 (node=serving-6192, call=4, rc=1,
> status=complete): unknown error
> 
> Here're snippet of the logs which I'm suspecting:
> 
> Jul 09 10:46:32 serving-6192 lrmd: [7321]: info: rsc:p_mysql:0:4: start
> Jul 09 10:46:32 serving-6192 lrmd: [7321]: info: RA output:
> (p_mysql:0:start:stderr) Error performing operation: The
> object/attribute does not exist
> 
> Jul 09 10:46:32 serving-6192 crm_attribute: [7420]: info: Invoked:
> /usr/sbin/crm_attribute -N serving-6192 -l reboot --name readable -v 0

Not enough logs ... at least for me ... to give more hints.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> 
> The strange thing is I can starting it manually:
> 
> export OCF_ROOT=/usr/lib/ocf
> export OCF_RESKEY_config="/etc/my.cnf"
> export OCF_RESKEY_pid="/var/run/mysqld/mysqld.pid"
> export OCF_RESKEY_socket="/var/lib/mysql/mysql.sock"
> export OCF_RESKEY_replication_user="repl"
> export OCF_RESKEY_replication_passwd="x"
> export OCF_RESKEY_max_slave_lag="60"
> export OCF_RESKEY_evict_outdated_slaves="false"
> export OCF_RESKEY_test_user="test_user"
> export OCF_RESKEY_test_passwd="x"
> 
> `sh -x /usr/lib/ocf/resource.d/percona/mysql start`: http://fpaste.org/RVGh/
> 
> Did I make something wrong?
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/p

Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-09 Thread Andreas Kurz

On 07/09/2012 12:58 PM, Nikola Ciprich wrote:
> Hello Andreas,
> 
> yes, You're right. I should have sent those in the initial post. Sorry about 
> that.
> I've created very simple test configuration on which I'm able to simulate the 
> problem.
> there's no stonith etc, since it's just two virtual machines for the test.
> 
> crm conf:
> 
> primitive drbd-sas0 ocf:linbit:drbd \
> params drbd_resource="drbd-sas0" \
> operations $id="drbd-sas0-operations" \
> op start interval="0" timeout="240s" \
> op stop interval="0" timeout="200s" \
> op promote interval="0" timeout="200s" \
> op demote interval="0" timeout="200s" \
> op monitor interval="179s" role="Master" timeout="150s" \
> op monitor interval="180s" role="Slave" timeout="150s"
> 
> primitive lvm ocf:lbox:lvm.ocf \

Why not using the RA that comes with the resource-agent package?

> op start interval="0" timeout="180" \
> op stop interval="0" timeout="180"
> 
> ms ms-drbd-sas0 drbd-sas0 \
>meta clone-max="2" clone-node-max="1" master-max="2" master-node-max="1" 
> notify="true" globally-unique="false" interleave="true" target-role="Started"
> 
> clone cl-lvm lvm \
>   meta globally-unique="false" ordered="false" interleave="true" 
> notify="false" target-role="Started" \
>   params lvm-clone-max="2" lvm-clone-node-max="1"
> 
> colocation col-lvm-drbd-sas0 inf: cl-lvm ms-drbd-sas0:Master
> 
> order ord-drbd-sas0-lvm inf: ms-drbd-sas0:promote cl-lvm:start
> 
> property $id="cib-bootstrap-options" \
>dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \
>cluster-infrastructure="openais" \
>expected-quorum-votes="2" \
>no-quorum-policy="ignore" \
>stonith-enabled="false"
> 
> lvm resource starts vgshared volume group on top of drbd (LVM filters are set 
> to
> use /dev/drbd* devices only)
> 
> drbd configuration:
> 
> global {
>usage-count no;
> }
> 
> common {
>protocol C;
> 
> handlers {
> pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
> /usr/lib/drbd/notify-emergency-reboot.sh; ";
> pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
> /usr/lib/drbd/notify-emergency-reboot.sh; ";
> local-io-error "/usr/lib/drbd/notify-io-error.sh; 
> /usr/lib/drbd/notify-emergency-shutdown.sh; ";
> 
> #pri-on-incon-degr 
> "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; 
> reboot -f";
> #pri-lost-after-sb 
> "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; 
> reboot -f";
> #local-io-error "/usr/lib/drbd/notify-io-error.sh; 
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; 
> halt -f";
> # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
> # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
> # before-resync-target 
> "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
> # after-resync-target 
> /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
> }
> 
> net {
> allow-two-primaries;
> after-sb-0pri discard-zero-changes;
> after-sb-1pri discard-secondary;
> after-sb-2pri call-pri-lost-after-sb;
> #rr-conflict disconnect;
> max-buffers 8000;
> max-epoch-size 8000;
> sndbuf-size 0;
> ping-timeout 50;
> }
> 
> syncer {
> rate 100M;
> al-extents 3833;
> #   al-extents 257;
> #   verify-alg sha1;
> }
> 
> disk {
> on-io-error   detach;
> no-disk-barrier;
> no-disk-flushes;
> no-md-flushes;
> }
> 
&

Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-07 Thread Andreas Kurz

On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
> hello,
> 
> I'm trying to solve quite mysterious problem here..
> I've got new cluster with bunch of SAS disks for testing purposes.
> I've configured DRBDs (in primary/primary configuration)
> 
> when I start drbd using drbdadm, it get's up nicely (both nodes
> are Primary, connected).
> however when I start it using corosync, I always get split-brain, although
> there are no data written, no network disconnection, anything..

your full drbd and Pacemaker configuration please ... some snippets from
something are very seldom helpful ...

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> here's drbd resource config:
> primitive drbd-sas0 ocf:linbit:drbd \
> params drbd_resource="drbd-sas0" \
> operations $id="drbd-sas0-operations" \
> op start interval="0" timeout="240s" \
> op stop interval="0" timeout="200s" \
> op promote interval="0" timeout="200s" \
> op demote interval="0" timeout="200s" \
> op monitor interval="179s" role="Master" timeout="150s" \
> op monitor interval="180s" role="Slave" timeout="150s"
> 
> ms ms-drbd-sas0 drbd-sas0 \
>meta clone-max="2" clone-node-max="1" master-max="2" master-node-max="1" 
> notify="true" globally-unique="false" interleave="true" target-role="Started"
> 
> 
> here's the dmesg output when pacemaker tries to promote drbd, causing the 
> splitbrain:
> [  157.646292] block drbd2: Starting worker thread (from drbdsetup [6892])
> [  157.646539] block drbd2: disk( Diskless -> Attaching ) 
> [  157.650364] block drbd2: Found 1 transactions (1 active extents) in 
> activity log.
> [  157.650560] block drbd2: Method to ensure write ordering: drain
> [  157.650688] block drbd2: drbd_bm_resize called with capacity == 584667688
> [  157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 
> pages=2231
> [  157.653760] block drbd2: size = 279 GB (292333844 KB)
> [  157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies
> [  157.673722] block drbd2: recounting of set bits took additional 2 jiffies
> [  157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on disk 
> bit-map.
> [  157.673972] block drbd2: disk( Attaching -> UpToDate ) 
> [  157.674100] block drbd2: attached to UUIDs 
> 0150944D23F16BAE::8C175205284E3262:8C165205284E3263
> [  157.685539] block drbd2: conn( StandAlone -> Unconnected ) 
> [  157.685704] block drbd2: Starting receiver thread (from drbd2_worker 
> [6893])
> [  157.685928] block drbd2: receiver (re)started
> [  157.686071] block drbd2: conn( Unconnected -> WFConnection ) 
> [  158.960577] block drbd2: role( Secondary -> Primary ) 
> [  158.960815] block drbd2: new current UUID 
> 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263
> [  162.686990] block drbd2: Handshake successful: Agreed network protocol 
> version 96
> [  162.687183] block drbd2: conn( WFConnection -> WFReportParams ) 
> [  162.687404] block drbd2: Starting asender thread (from drbd2_receiver 
> [6927])
> [  162.687741] block drbd2: data-integrity-alg: 
> [  162.687930] block drbd2: drbd_sync_handshake:
> [  162.688057] block drbd2: self 
> 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 bits:0 
> flags:0
> [  162.688244] block drbd2: peer 
> 7EC38CBFC3D28FFF:0150944D23F16BAF:8C175205284E3263:8C165205284E3263 bits:0 
> flags:0
> [  162.688428] block drbd2: uuid_compare()=100 by rule 90
> [  162.688544] block drbd2: helper command: /sbin/drbdadm initial-split-brain 
> minor-2
> [  162.691332] block drbd2: helper command: /sbin/drbdadm initial-split-brain 
> minor-2 exit code 0 (0x0)
> 
> to me it seems to be that it's promoting it too early, and I also wonder why 
> there is the 
> "new current UUID" stuff?
> 
> I'm using centos6, kernel 3.0.36, drbd-8.3.13, pacemaker-1.1.6
> 
> could anybody please try to advice me? I'm sure I'm doing something stupid, 
> but can't figure out what...
> 
> thanks a lot in advance
> 
> with best regards
> 
> nik
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 







signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Confusing semantics of colocation sets

2012-07-07 Thread Andreas Kurz

On 07/02/2012 08:28 PM, Phil Frost wrote:
> On 07/02/2012 12:50 PM, Dejan Muhamedagic wrote:
>> What is being mangled actually? The crm shell does what is
>> possible given the pacemaker RNG schema. It is unfortunate that
>> the design is slightly off, but that cannot be fixed in the crm
>> syntax.
> 
> I will demonstrate my point by offering a quiz to the list. Tell me,
> without running these examples, what effect they will have:
> 
> [1] colocation foo inf: a b ( c:Master d )

whenever you change the Role an extra set is created  no different
roles in a resource set ... and between colocation-sets the direction of
dependencies is like in a simple 2-resource colocation

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> [2] colocation foo inf: a b
> [3] colocation foo inf: a b c
> [4] colocation foo inf: a b c:Master
> 
> Hints:
> 
> - there are three resource sets in [1]
> - [2] is not a subnet of [3]
> - [4] is not ordered a, b, c
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Pacemaker hang with hardware reset

2012-07-07 Thread Andreas Kurz

On 07/04/2012 01:20 PM, Damiano Scaramuzza wrote:
> Hi Emmanuel,
> yes I use drbd level fence as in linbit user guide
> 
> disk {
> fencing resource-only;
> ...

In a dual-primary setup, use "resource-and-stonith"

>   }
>   handlers {
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";

use a fencing method that uses the same stonith mechanism as Pacemaker
.. there is the "stonith_admin-fence-peer.sh" script available.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] 2-node cluster doesn't move resources away from a failed node

2012-07-07 Thread Andreas Kurz

On 07/05/2012 04:12 PM, David Guyot wrote:
> Hello, everybody.
> 
> As the title suggests, I'm configuring a 2-node cluster but I've got a
> strange issue here : when I put a node in standby mode, using "crm node
> standby", its resources are correctly moved to the second node, and stay
> there even if the first is back on-line, which I assume is the preferred
> behavior (preferred by the designers of such systems) to avoid having
> resources on a potentially unstable node. Nevertheless, when I simulate
> failure of the node which run resources by "/etc/init.d/corosync stop",
> the other node correctly fence the failed node by electrically resetting
> it, but it doesn't mean that it will mount resources on himself; rather,
> it waits the failed node to be back on-line, and then re-negotiates
> resource placement, which inevitably leads to the failed node restarting
> the resources, which I suppose is a consequence of the resource
> stickiness still recorded by the intact node : because this node still
> assume that resources are running on the failed node, it assumes that
> resources prefer to stay on the first node, even if it has failed.
> 
> When the first node, Vindemiatrix, has shuts down Corosync, the second,
> Malastare, reports this :
> 
> root@Malastare:/home/david# crm_mon --one-shot -VrA
> 
> Last updated: Thu Jul  5 15:27:01 2012
> Last change: Thu Jul  5 15:26:37 2012 via cibadmin on Malastare
> Stack: openais
> Current DC: Malastare - partition WITHOUT quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 17 Resources configured.
> 
> 
> Node Vindemiatrix: UNCLEAN (offline)

Pacemaker thinks fencing was not successful and will not recover
resources until STONITH was successful ... or the node returns an it is
possible to probe resource states

> Online: [ Malastare ]
> 
> Full list of resources:
> 
>  soapi-fencing-malastare(stonith:external/ovh):Started Vindemiatrix
>  soapi-fencing-vindemiatrix(stonith:external/ovh):Started Malastare
>  Master/Slave Set: ms_drbd_svn [drbd_svn]
>  Masters: [ Vindemiatrix ]
>  Slaves: [ Malastare ]
>  Master/Slave Set: ms_drbd_pgsql [drbd_pgsql]
>  Masters: [ Vindemiatrix ]
>  Slaves: [ Malastare ]
>  Master/Slave Set: ms_drbd_backupvi [drbd_backupvi]
>  Masters: [ Vindemiatrix ]
>  Slaves: [ Malastare ]
>  Master/Slave Set: ms_drbd_www [drbd_www]
>  Masters: [ Vindemiatrix ]
>  Slaves: [ Malastare ]
>  fs_www(ocf::heartbeat:Filesystem):Started Vindemiatrix
>  fs_pgsql(ocf::heartbeat:Filesystem):Started Vindemiatrix
>  fs_svn(ocf::heartbeat:Filesystem):Started Vindemiatrix
>  fs_backupvi(ocf::heartbeat:Filesystem):Started Vindemiatrix
>  VirtualIP(ocf::heartbeat:IPaddr2):Started Vindemiatrix
>  OVHvIP(ocf::pacemaker:OVHvIP):Started Vindemiatrix
>  ProFTPd(ocf::heartbeat:proftpd):Started Vindemiatrix
> 
> Node Attributes:
> * Node Malastare:
> + master-drbd_backupvi:0  : 1
> + master-drbd_pgsql:0 : 1
> + master-drbd_svn:0   : 1
> + master-drbd_www:0   : 1
> 
> As you can see, the node failure is detected. This state leads to
> attached log file.
> 
> Note that both ocf::pacemaker:OVHvIP and stonith:external/ovh are custom
> resources which uses my server provider's SOAP API to provide intended
> services. The STONITH agent does nothing but returning exit status 0
> when start, stop, on or off actions are required, but returns the 2
> nodes names when hostlist or gethosts actions are required and, when
> reset action is required, effectively resets faulting node using the
> provider API. As this API doesn't provide reliable mean to know the
> exact moment of resetting, the STONITH agent pings the faulting node
> every 5 seconds until ping fails, then forks a process which pings the
> faulting node every 5 seconds until it answers, then, due to external
> VPN being not yet installed by the provider, I'm forced to emulate it
> with OpenVPN (which seems to be unable to re-establish a connection lost
> minutes ago, leading to a dual brain situation), the STONITH agent
> restarts OpenVPN to re-establish the connection, then restarts Corosync
> and Pacemaker.
> 
> Aside from the VPN issue, of which I'm fully aware of performance and
> stability issues, I thought that Pacemaker would, as soon as the STONITH
> agent returns exit status 0, start the resources on the remaining node,
> but it doesn't. Instead, it seems that the STONITH reset action waits
> too long to report a successful reset, delay which reaches some internal
> timeout, which in turn leads Pacemaker to assume that STONITH agent
> failed, therefore, while eternally trying to reset the node (which only
> leads to the API issuing an error because the last reset request was
> less than 5 minutes ago, something forbidden) sto

Re: [Pacemaker] Problem setting-up DRBD v8,4 with Pacemaker v1.1.6

2012-07-04 Thread Andreas Kurz

On 07/04/2012 09:16 PM, Irfan Ali wrote:
> Hi all,
> 
> We are trying to set-up an HA pair  on RHEL 6.2 using DRBD (v
> 8.4.1-2), Pacemaker (v 1.1.6-3) and Corosync (v 1.4.1-4). We could
> make DRBD work independently syncing the two machines of the pair. But
> our problem begins when we try to connect DRBD with Pacemaker. Even
> though Pacemaker was able to detect the resources corresponding to
> DRBD running on both the machines, it does not promoted any one of
> them as master. I have pasted the output from 'crm status' below, it
> shows that the resource 'ms-drbd' is running as slaves on both the
> machines of our HA pair.

Please, don't paste snippets ... only full configurations are useful ...
more inline

> 
> We fiddled a lot with the configuration for both Pacemaker and DRBD,
> but couldn't find anything to fix this problem. Any help / suggestions
> from you guys will be highly appreciated.
> 
> ===
> 
> [root@c713 linbit]# crm status
> 
> 
> 
> Last updated: Wed Jul  4 08:42:37 2012
> 
> Last change: Wed Jul  4 07:34:53 2012 via crm_resource on c710.siemlab.com
> 
> Stack: openais
> 
> Current DC: c710.siemlab.com - partition with quorum
> 
> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
> 
> 2 Nodes configured, 2 expected votes
> 
> 14 Resources configured.
> 
> 
> 
> 
> 
> Online: [ c713.siemlab.com c710.siemlab.com ]
> 
> 
> 
> Clone Set: connectivity [ping]
> 
>  Started: [ c710.siemlab.com c713.siemlab.com ]

What is the drbd state in /proc/drbd? everything ok?

> 
> Clone Set: powerstatus [cpps0]
> 
>  Started: [ c710.siemlab.com c713.siemlab.com ]
> 
> Master/Slave Set: ms-drbd [drbd0]
> 
>  Slaves: [ c710.siemlab.com c713.siemlab.com ]
> 
> c713.siemlab.com-stonith   (stonith:fence_ipmilan):Started
> c710.siemlab.com
> 
> c710.siemlab.com-stonith   (stonith:fence_ipmilan):Started
> c713.siemlab.com
> 
> 
> 
> ===
> 
> We are using the following CIB configuration related to DRBD:
> 
> 
> 
> 
> 
>   
> 
>   
> 
>   
> 
>   
> 
>   
> 
>   
> 
>   
> 
>   

remove that "ordered" attribute

> 
>name="crm-feature-set" value="3.0.5"/>
> 
>   
> 
> 
> 
> 
> 
>   
> 
> 
> 
>   
> 
>   
> 
>  role="Master" timeout="5min"/>
> 
>  timeout="4min"/>
> 
>  timeout="4min"/>

you don't need roles for star/stop operations

> 
>role="Slave" timeout="6min"/>
> 
>  timeout="4min"/>
> 
>  timeout="4min"/>
> 
>   
> 
> 
> 
>  
> 
> ===
> 
> The following is the content of drbd.conf:
> 
> global {
> 
> usage-count no;
> 
> }
> 
> common {
> 
> net {
> 
> protocol C;
> 
> }
> 
>   startup {
> 
> wfc-timeout 120;
> 
> degr-wfc-timeout 120;
> 
>   }
> 
> }
> 
> resource var_nsm {
> 
> disk {
> 
> on-io-error detach;
> 
> no-disk-barrier;
> 
> no-disk-flushes;
> 
> fencing resource-only;
> 
> }
> 
> handlers {
> 
> fence-peer
> "/usr/lib/drbd/crm-fence-peer.sh";
> 
> after-resync-target
> "/usr/lib/drbd/crm-unfence-peer.sh";
> 
> }
> 
> net {
> 
> #rate 333M;
> 
> after-sb-1pri 
> discard-secondary;

... you know that can discard valid data?


Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> 
> max-buffers 8000;
> 
> max-epoch-size 8000;
> 
> sndbuf-size 0;
> 
> }
> 
> on c713.siemlab.com {
> 
> device /dev/drbd1;
> 
>  disk /dev/sdb3;
> 
>  address 192.168.2.2:7791;
> 
>  meta-disk internal;
> 
>   }
> 
>   on c710.siemlab.com {
> 
> device /dev/drbd1;
> 
>  disk /dev/sdb3;
> 
>  address 192.168.2.4:7791;
> 
>  meta-disk internal;
> 
>

Re: [Pacemaker] Call cib_query failed (-41): Remote node did not respond

2012-07-04 Thread Andreas Kurz

On 07/04/2012 12:36 AM, Brian J. Murrell wrote:
> On 12-07-03 06:17 PM, Andrew Beekhof wrote:
>>
>> Even adding passive nodes multiplies the number of probe operations
>> that need to be performed and loaded into the cib.
> 
> So it seems.  I just would have not thought they be such a load since
> from a simplistic perspective, since they are not trying to update the
> CIB, it seems they just need an update of it when the rest of the nodes
> doing the updating are done.  But I do admit that could be a simplistic
> view.
> 
>> Did you try any of the settings I suggested?

beside increasing the batch limit to a higher value ... did you also
tune corosync totem timings? There are some parameters like token,
send_join, join and token_retransmits_before_loss_const that typically
need to be increased when a lot of cluster communication traffic has to
be handled.

Best Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> 
> The only setting I saw you suggest was "batch-limit" and at first glance
> it did not seem clear to me which way to adjust this (up or down) and I
> was running out of time for experimentation and just needed to get to
> something that works.
> 
> So now that pressure is off and I have a bit of time to experiment, what
> value would you suggest for that parameter given the 32 resource and
> constraints I want to add on a 16 node cluster?
> 
> Cheers,
> b.
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Centos 6.2 corosync errors after reboot prevent joining

2012-07-03 Thread Andreas Kurz

On 07/02/2012 06:47 PM, Martin de Koning wrote:
> Hi all,
> 
> Reasonably new to pacemaker and having some issues with corosync loading
> the pacemaker plugin after a reboot of the node. It looks like similar
> issues have been posted before but I haven't found a relavent fix.
> 
> The Centos 6.2 node was online before the reboot and restarting the
> corosync and pacemaker services caused no issues. Since the reboot and
> subsequent reboots, I am unable to get pacemaker to join the cluster.
> 
> After the reboot corosync now reports the following:
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.cib failed: ipc delivery
> failed (rc=-2)
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.cib failed: ipc delivery
> failed (rc=-2)
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.cib failed: ipc delivery
> failed (rc=-2)
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.cib failed: ipc delivery
> failed (rc=-2)
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.cib failed: ipc delivery
> failed (rc=-2)
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.crmd failed: ipc delivery
> failed (rc=-2)

This messages typically occur if you don't start pacemaker service after
corosync.

Jul  2 17:56:21 sessredis-03 corosync[1644]:   [pcmk  ] info:
get_config_opt: Found '1' for option: ver
Jul  2 17:56:21 sessredis-03 corosync[1644]:   [pcmk  ] info:
process_ais_conf: Enabling MCP mode: Use the Pacemaker init script to
complete Pacemaker startup

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> The full syslog is here:
> http://pastebin.com/raw.php?i=f9eBuqUh
> 
> corosync-1.4.1-4.el6_2.3.x86_64
> pacemaker-1.1.6-3.el6.x86_64
> 
> I have checked the the obvious such as inter-cluster communication and
> firewall rules. It appears to me that there may be an issue with the
> with Pacemaker cluster information base and not corosync. Any ideas? Can
> I clear the CIB manually somehow to resolve this?
> 
> Cheers
> Martin
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Multiple split-brain problem

2012-06-26 Thread Andreas Kurz

On 06/26/2012 03:49 PM, coma wrote:
> Hello,
> 
> i running on a 2 node cluster with corosync & drbd in active/passive
> mode for mysql hight availablity.
> 
> The cluster working fine (failover/failback & replication ok), i have no
> network outage (network is monitored and i've not seen any failure) but
> split-brain occurs very often and i don't anderstand why, maybe you can
> help me?

Are the nodes virtual machines or have a high load from time to time?

> 
> I'm new pacemaker/corosync/DRBD user, so my cluster and drbd
> configuration are probably not optimal, so if you have any comments,
> tips or examples I would be very grateful!
> 
> Here is an exemple of corosync log when a split-brain occurs (1 hour log
> to see before/after split-brain):
> 
> http://pastebin.com/3DprkcTA

Increase your token value in corosync.conf to a higher value ... like
10s, configure resource-level fencing in DRBD and setup STONITH for your
cluster and use redundant corosync rings.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Thank you in advance for any help!
> 
> 
> More details about my configuration:
> 
> I have:
> One prefered "master" node (node1) on a virtual server, and one "slave"
> node on a physical server.
> On each server,
> eth0 is connected on my main LAN for client/server communication (with
> cluster VIP)
> Eth1 is connected on a dedicated Vlan for corosync communication
> (network: 192.168.3.0 /30)
> Eth2 is connected on a dedicated Vlan for drbd replication (network:
> 192.168.2.0/30 )
> 
> Here is my drbd configuration:
> 
> 
> resource drbd-mysql {
> protocol C;
> disk {
> on-io-error detach;
> }
> handlers {
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> split-brain "/usr/lib/drbd/notify-split-brain.sh root";
> }
> net {
> cram-hmac-alg sha1;
> shared-secret "secret";
> after-sb-0pri discard-younger-primary;
> after-sb-1pri discard-secondary;
> after-sb-2pri call-pri-lost-after-sb;
> }
> startup {
> wfc-timeout  1;
> degr-wfc-timeout 1;
> }
> on node1{
> device /dev/drbd1;
> address 192.168.2.1:7801 ;
> disk /dev/sdb;
> meta-disk internal;
> }
> on node2 {
> device /dev/drbd1;
> address 192.168.2.2:7801 ;
> disk /dev/sdb;
> meta-disk internal;
> }
> }
> 
> 
> Here my cluster config:
> 
> node node1 \
> attributes standby="off"
> node node2 \
> attributes standby="off"
> primitive Cluster-VIP ocf:heartbeat:IPaddr2 \
> params ip="10.1.0.130" broadcast="10.1.7.255" nic="eth0"
> cidr_netmask="21" iflabel="VIP1" \
> op monitor interval="10s" timeout="20s" \
> meta is-managed="true"
> primitive cluster_status_page ocf:heartbeat:ClusterMon \
> params pidfile="/var/run/crm_mon.pid"
> htmlfile="/var/www/html/cluster_status.html" \
> op monitor interval="4s" timeout="20s"
> primitive datavg ocf:heartbeat:LVM \
> params volgrpname="datavg" exclusive="true" \
> op start interval="0" timeout="30" \
> op stop interval="0" timeout="30"
> primitive drbd_mysql ocf:linbit:drbd \
> params drbd_resource="drbd-mysql" \
> op monitor interval="15s"
> primitive fs_mysql ocf:heartbeat:Filesystem \
> params device="/dev/datavg/data" directory="/data" fstype="ext4"
> primitive mail_alert ocf:heartbeat:MailTo \
> params email="myem...@test.com " \
> op monitor interval="10" timeout="10" depth="0"
> primitive mysqld ocf:heartbeat:mysql \
> params binary="/usr/bin/mysqld_safe" config="/etc/my.cnf"
> datadir="/data/mysql/databases" user="mysql"
> pid="/var/run/mysqld/mysqld.pid" socket="/var/lib/mysql/mysql.sock"
> test_passwd="cluster_test" test_table="Cluster_Test.dbcheck"
> test_user="cluster_test" \
> op start interval="0" timeout="120" \
> op stop interval="0" timeout="120" \
> op monitor interval="30s" timeout="30s" OCF_CHECK_LEVEL="1"
> target-role="Started"
> group mysql datavg fs_mysql Cluster-VIP mysqld cluster_status_page
> mail_alert
> ms ms_drbd_mysql drbd_mysql \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> location mysql-preferred-node mysql inf: node1
> colocation mysql_on_drbd inf: mysql ms_drbd_mysql:Master
> order mysql_after_drbd inf: ms_drbd_mysql:promote mysql:start
> property $id="cib-bootstrap-options" \
> dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith-enabled="false" \
> no-quorum-policy="ignore" \
> last-lrm-refresh="1340701656"
> rsc_defaults $id="rsc-options" \
> resource-sti

Re: [Pacemaker] "Grouping" of master/slave resources

2012-06-25 Thread Andreas Kurz

On 06/25/2012 06:14 PM, Stallmann, Andreas wrote:
> Hi!
> 
>  
> 
> We are currently switching from mysql-on-drbd with tomcat and a shared
> IP to mysql-master-slave with tomcat-master-slave and shared IP and an
> additonal cloned service (activemq).
> 
>  
> 
> Up until now we had quite an easy setup – especially for the admins who
> did the daily maintenance. All services (exept drbd) where grouped like
> this:
> 
>  
> 
> Resource Group: cluster_grp
> 
>  fs_r0  (ocf::heartbeat:Filesystem):Started qs-cms-appl01
> 
>  sharedIP   (ocf::heartbeat:IPaddr2):   Started qs-cms-appl01
> 
>  database_res   (ocf::heartbeat:mysql): Started qs-cms-appl01
> 
>  tomcat_res (ocf::ucrs:tomcat): Started qs-cms-appl01
> 
> Master/Slave Set: ms_drbd_r0
> 
>  Masters: [ qs-cms-appl01 ]
> 
>  Slaves: [ qs-cms-appl02 ]
> 
>  
> 
> There was only one further colocation-constrain, which let fs_r0 only
> run where ms_drbd was running in Master mode. Easy.
> 
>  
> 
> Now, with our new scenario, the situation is a little more complex.
> First, the start order:
> 
>  
> 
> sharedIP
> 
> ms_MySQL:Master
> 
> activemq_clone
> 
> ms_tomcat:Master
> 
>  
> 
> then the constrains:
> 
>  
> 
> ms_MySQL:Master only where shared_IP
> 
> ms_tomcat:Master only where database
> 
> activemq_clone only where database
> 
>  
> 
> I thought this might be quite easy to achieve with grouping the
> services, but crm won’t let me put Master/Slave resource into a group.
> Why is that?

Groups can only contain primitive resources, though you could clone a
group or have a master-slave group.

You can use resource-sets to create constraints containing primitives
and master-slave resources quite similar to the way groups work.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> 
>  
> 
> If I try...
> 
>  
> 
> group cluster-grp p_writer_vip ms_MySQL:Master activemq_clone
> ms_tomcat:Master
> 
>  
> 
> I get:
> 
>  
> 
> ERROR: object ms_MySQL:Master does not exist
> 
> ERROR: object ms_tomcat:Master does not exist
> 
>  
> 
> Any ideas welcome.
> 
>  
> 
> Cheers and thanks,
> 
>  
> 
> Andreas
> 
>  
> 
> --
> CONET Solutions GmbH
> 
>  
> 
> 
> CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
> Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
> Geschäftsführer/Managing Director: Anke Höfer 
> 
>  
> 
>  
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 







signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Two slave nodes, neither will promote to Master

2012-06-25 Thread Andreas Kurz

On 06/25/2012 05:48 PM, Regendoerp, Achim wrote:
> Hi,
> 
> I'm currently looking at two VMs which are supposed to mount a drive in
> a given directory, depending on who's the master. This was decided above
> me, therefore no DRBD stuff (which would've made things easier), but
> still using corosync/pacemaker to do the cluster work.
> 
> As it is currently, both nodes are online and configured, but none are
> switching to Master. In lack of a DRBD resource, I tried using the Dummy
> Pacemaker. If that's not the correct RA, please enlighten me on this too.

Well, if you don't need DRBD anymore remove its configuration including
order/colocation constraints from the cluster configuration.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Below's the current config:
> 
> node NODE01 \
> attributes standby="off"
> node NODE02 \
> attributes standby="off"
> primitive clusterIP ocf:heartbeat:IPaddr2 \
> params ip="10.64.96.31" nic="eth1:1" \
> op monitor on-fail="restart" interval="5s"
> primitive clusterIParp ocf:heartbeat:SendArp \
> params ip="10.64.96.31" nic="eth1:1"
> primitive fs_nfs ocf:heartbeat:Filesystem \
> params device="/dev/vg_shared/lv_nfs_01" directory="/shared"
> fstype="ext4" \
> op start interval="0" timeout="240" \
> op stop interval="0" timeout="240" on-fail="restart"
> primitive ms_dummy ocf:pacemaker:Dummy \
> op start interval="0" timeout="240" \
> op stop interval="0" timeout="240" \
> op monitor interval="15" role="Master" timeout="240" \
> op monitor interval="30" role="Slave" on-fail="restart" timeout-240
> primitive nfs_share ocf:heartbeat:nfsserver \
> params nfs_ip="10.64.96.31" nfs_init_script="/etc/init.d/nfs"
> nfs_shared_infodir="/shared/nfs" nfs_notify_cmd="/sbin/rpc.statd" \
> op start interval="0" timeout="240" \
> op stop interval="0" timeout="240" on-fail="restart"
> group Services clusterIP clusterIParp fs_nfs nfs_share \
> meta target-role="Started" is-managed="true"
> multiple-active="stop_start"
> ms ms_nfs ms_dummy \
> meta target-role="Master" master-max="1" master-node="1"
> clone-max="2" clone-node-max="1" notify="true"
> colocation services_on_master inf: Services ms_nfs:Master
> order fs_before_services inf: ms_nfs:promote Services:start
> property $id="cib-bootstrap-options" \
> dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> no-quorum-policy="ignore" \
> stonith-enabled="false"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="200"
> 
> These are the installed packages
> 
> cluster-glue-1.0.5-2.el6.x86_64
> cluster-glue-libs-1.0.5-2.el6.x86_64
> clusterlib-3.0.12.1-23.el6_2.1.x86_64
> corosync-1.4.1-4.el6_2.2.x86_64
> corosynclib-1.4.1-4.el6_2.2.x86_64
> pacemaker-1.1.6-3.el6.x86_64
> pacemaker-cli-1.1.6-3.el6.x86_64
> pacemaker-cluster-libs-1.1.6-3.el6.x86_64
> pacemaker-libs-1.1.6-3.el6.x86_64
> resource-agents-3.9.2-7.el6.x86_64
> 
> running on CentOS 6.
> 
> If anyone has any idea/suggestion/input/etc. I'd be grateful.
> 
> Cheers,
> 
> Achim
> 
> This email has been sent from Gala Coral Group Limited ("GCG") or a 
> subsidiary or associated company. GCG is registered in England with company 
> number 07254686.   Registered office address: 71 Queensway, London W2 4QH, 
> United Kingdom; website: www.galacoral.com.
> 
> This e-mail message (and any attachments) is confidential and may contain 
> privileged and/or proprietorial information protected by legal rules.  It is 
> for use by the intended addressee only. If you believe you are not the 
> intended recipient or that the sender is not authorised to send you the 
> email, please return it to the sender (and please copy it to 
> h...@galacoral.com) and then delete it from your computer.  You should not 
> otherwise copy or disclose its contents to anyone.
> 
> Except where this email is sent in the usual course of business, the views 
> expressed are those of the sender and not necessarily ours.  We reserve the 
> right to monitor all emails sent to and from our businesses, to protect the 
> businesses and to ensure compliance with internal policies.
> 
> Emails are not secure and cannot be guaranteed to be error-free, as they can 
> be intercepted, amended, lost or destroyed, and may contain viruses; anyone 
> who communicates with us by email is taken to accept these risks.  GCG 
> accepts no liability for any loss or damage which may be caused by software 
> viruses.
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

Re: [Pacemaker] Collocating resource with a started clone instance

2012-06-22 Thread Andreas Kurz

On 06/22/2012 12:40 PM, Sergey Tachenov wrote:
>>> group postgres pgdrive_fs DBIP postgresql
>>> colocation postgres_on_drbd inf: postgres ms_drbd_pgdrive:Master
>>> order postgres_after_drbd inf: ms_drbd_pgdrive:promote postgres:start
>>> ...
>>> location DBIPcheck DBIP \
>>>rule $id="DBIPcheck-rule" 1: defined pingd and pingd gt 0
>>> location master-prefer-node1 DBIP 50: srvplan1
>>> colocation DBIP-on-web 1000: DBIP tomcats
>>
>> try inf: ... 1000: will be not enough  because DBIP is also part of
>> postgres group and that group must follow the DRBD Master
> 
> I understand, but why doesn't it move the DRBD Master (and the whole
> group) then? If I set the score INF, then the DBIP gets stopped
> completely, and ptest shows -INF for both nodes. And why the ping
> location constraint works then? If the ping is not answered, the whole
> group moves perfectly fine.

hmm ... does it work better with a high value like 10: ?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 



signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Collocating resource with a started clone instance

2012-06-22 Thread Andreas Kurz

On 06/22/2012 11:58 AM, Sergey Tachenov wrote:
> Hi!
> 
> I'm trying to set up a 2-node cluster. I'm new to pacemaker, but
> things are getting better and better. However, I am completely at a
> loss here.
> 
> I have a cloned tomcat resource, which runs on both nodes and doesn't
> really depend on anything (it doesn't use DRBD or anything else of
> that sort). But I'm trying to get pacemaker move the cluster IP to
> another node in case tomcat fails. Here's the relevant parts of my
> config:
> 
> node srvplan1
> node srvplan2
> primitive DBIP ocf:heartbeat:IPaddr2 \
>params ip="1.2.3.4" cidr_netmask="24" \
>op monitor interval="10s"
> primitive drbd_pgdrive ocf:linbit:drbd \
>params drbd_resource="pgdrive" \
>op start interval="0" timeout="240" \
>op stop interval="0" timeout="100"
> primitive pgdrive_fs ocf:heartbeat:Filesystem \
>params device="/dev/drbd0" directory="/hd2" fstype="ext4"
> primitive ping ocf:pacemaker:ping \
>params host_list="193.233.59.2" multiplier="1000" \
>op monitor interval="10"
> primitive postgresql ocf:heartbeat:pgsql \
>params pgdata="/hd2/pgsql" \
>op monitor interval="30" timeout="30" depth="0" \
>op start interval="0" timeout="60" \
>op stop interval="0" timeout="60" \
>meta target-role="Started"
> primitive tomcat ocf:heartbeat:tomcat \
>params java_home="/usr/lib/jvm/jre"
> catalina_home="/usr/share/tomcat" tomcat_user="tomcat"
> script_log="/home/tmo/log/tomcat.log"
> statusurl="http://127.0.0.1:8080/status/"; \
>op start interval="0" timeout="60" \
>op stop interval="0" timeout="120" \
>op monitor interval="30" timeout="30"
> group postgres pgdrive_fs DBIP postgresql
> ms ms_drbd_pgdrive drbd_pgdrive \
>meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> clone pings ping \
>meta interleave="true"
> clone tomcats tomcat \
>meta interleave="true" target-role="Started"
> location DBIPcheck DBIP \
>rule $id="DBIPcheck-rule" 1: defined pingd and pingd gt 0
> location master-prefer-node1 DBIP 50: srvplan1
> colocation DBIP-on-web 1000: DBIP tomcats

try inf: ... 1000: will be not enough  because DBIP is also part of
postgres group and that group must follow the DRBD Master

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> colocation postgres_on_drbd inf: postgres ms_drbd_pgdrive:Master
> order postgres_after_drbd inf: ms_drbd_pgdrive:promote postgres:start
> 
> As you can see, there are three explicit constraints for the DBIP
> resource: preferred node (srvplan1, score 50), successful ping (score
> 1) and running tomcat (score 1000). There's also the resource
> stickiness set to 100. Implicit constraints include collocation of the
> postgres group with the DRBD master instance.
> 
> The ping check works fine: if I unplug the external LAN cable or use
> iptables to block pings, everything gets moved to another node.
> 
> Check for tomcat isn't working for some reason, though:
> 
> [root@srvplan1 bin]# crm_mon -1
> 
> Last updated: Fri Jun 22 10:06:59 2012
> Last change: Fri Jun 22 09:43:16 2012 via cibadmin on srvplan1
> Stack: openais
> Current DC: srvplan1 - partition with quorum
> Version: 1.1.7-2.fc16-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 17 Resources configured.
> 
> 
> Online: [ srvplan1 srvplan2 ]
> 
>  Master/Slave Set: ms_drbd_pgdrive [drbd_pgdrive]
> Masters: [ srvplan1 ]
> Slaves: [ srvplan2 ]
>  Resource Group: postgres
> pgdrive_fs (ocf::heartbeat:Filesystem):Started srvplan1
> DBIP   (ocf::heartbeat:IPaddr2):   Started srvplan1
> postgresql (ocf::heartbeat:pgsql): Started srvplan1
>  Clone Set: pings [ping]
> Started: [ srvplan1 srvplan2 ]
>  Clone Set: tomcats [tomcat]
> Started: [ srvplan2 ]
> Stopped: [ tomcat:0 ]
> 
> Failed actions:
>tomcat:0_start_0 (node=srvplan1, call=37, rc=-2, status=Timed
> Out): unknown exec error
> 
> As you can see, tomcat is stopped on srvplan1 (I have deliberately
> messed up the startup scripts), but everything else still runs there.
> ptest -L -s shows:
> 
> clone_color: ms_drbd_pgdrive allocation score on srvplan1: 10350
> clone_color: ms_drbd_pgdrive allocation score on srvplan2: 1
> clone_color: drbd_pgdrive:0 allocation score on srvplan1: 10100
> clone_color: drbd_pgdrive:0 allocation score on srvplan2: 0
> clone_color: drbd_pgdrive:1 allocation score on srvplan1: 0
> clone_color: drbd_pgdrive:1 allocation score on srvplan2: 10100
> native_color: drbd_pgdrive:0 allocation score on srvplan1: 10100
> native_color: drbd_pgdrive:0 allocation score on srvplan2: 0
> native_color: drbd_pgdrive:1 allocation score on srvplan1: -INFINITY
> native_color: drbd_pgdrive:1 allocation score on srvplan2: 10100
> drbd_pgdrive:0 promotion score on srvplan1: 30700
> drbd_pgdrive:1 pr

Re: [Pacemaker] "ERROR: Wrong stack o2cb" when trying to start o2cb service in Pacemaker cluster

2012-06-22 Thread Andreas Kurz

igration-threshold=100
> + (8) probe: rc=8 (master)
>p_o2cb:1: migration-threshold=100
> + (10) probe: rc=5 (not installed)
> * Node Malastare:
>soapi-fencing-vindemiatrix: migration-threshold=100
> + (4) start: rc=0 (ok)
>p_drbd_pgsql:0: migration-threshold=100
> + (5) probe: rc=8 (master)
>p_drbd_svn:0: migration-threshold=100
> + (6) probe: rc=8 (master)
>p_drbd_www:0: migration-threshold=100
> + (7) probe: rc=8 (master)
>p_drbd_backupvi:0: migration-threshold=100
> + (8) probe: rc=8 (master)
>p_o2cb:0: migration-threshold=100
> + (10) probe: rc=5 (not installed)
> 
> Failed actions:
> p_o2cb:1_monitor_0 (node=Vindemiatrix, call=10, rc=5,
> status=complete): not installed
> p_o2cb:0_monitor_0 (node=Malastare, call=10, rc=5, status=complete):
> not installed
> 
> Nevertheless, I noticed a strange error message in Corosync/Pacemaker logs :
> Jun 22 10:54:25 Vindemiatrix lrmd: [24580]: info: RA output:
> (p_controld:1:probe:stderr) dlm_controld.pcmk: no process found

this looks like the initial probing so there is no running controld is
expected

> 
> This message was immediately followed by "Wrong stack" errors, and

check the content of /sysfs/fs/ocfs2/loaded_cluster_plugins ... and if
you have that configfile and it contains the value "user" this is a good
sign you have started ocfs2/o2cb via init ;-)

Regards,
Andreas

> because dlm_controld.pcmk seems to be Pacemaker DLM dæmon, I strongly
> thinks these messages are related. Strangely, even if I have this dæmon
> executable in /usr/sbin, it's not loaded by Pacemaker :
> root@Vindemiatrix:/home/david# ls /usr/sbin/dlm_controld.pcmk
> /usr/sbin/dlm_controld.pcmk
> root@Vindemiatrix:/home/david# ps fax | grep pcmk
> 26360 pts/1S+ 0:00  \_ grep pcmk
> 
> But, if I understood correctly, such process should be launched by DLM
> resource, and as I have no error messages concerning launching such a
> process whereas its executable is present, do you know where this
> problem could come from?
> 
> Thank you in advance.
> 
> Kind regards.
> 
> PS: I'll have the next week off, so I won't be able to answer you
> between this evening and the 2th of July.
> 
> Le 20/06/2012 17:39, Andreas Kurz a écrit :
>> On 06/20/2012 03:49 PM, David Guyot wrote:
>>> Actually, yes, I start DRBD manually, because this is currently a test
>>> configuration which relies on OpenVPN for the communications between
>>> these 2 nodes. I have no order and collocation constraints because I'm
>>> discovering these software and trying to configure them step by step and
>>> make resources work before ordering them (nevertheless, I just tried to
>>> configure DLM/O2CB constraints, but they fail, apparently because they
>>> are relying on O2CB, which causes the problem I wrote you about.) And I
>>> have no OCFS2 mounts because I was on the assumption that OCFS2 wouldn't
>>> mount partitions without O2CB and DLM, which seems to be right :
>> In fact it won't work without constraints, even if you are only testing
>> e.g. controld and o2cb must run on the same node (in fact on both nodes
>> of course) and controld must run before o2cb.
>>
>> And the error message you showed in a previous mail:
>>
>> 2012/06/20_09:04:35 ERROR: Wrong stack o2cb
>>
>> ... implies, that you are already running the native ocfs2 cluster stack
>> outside of pacemaker. You did a "/etc/init.d/ocfs2" stop before starting
>> your cluster tests and it is still stopped? And if it is stopped, a
>> cleanup of cl_ocfs2mgmt resource should start that resource ... if there
>> are no more other errors.
>>
>> You installed dlm-pcmk and ocfs2-tools-pacemaker packages from backports?
>>
>>> root@Malastare:/home/david# crm_mon --one-shot -VroA
>>> 
>>> Last updated: Wed Jun 20 15:32:50 2012
>>> Last change: Wed Jun 20 15:28:34 2012 via crm_shadow on Malastare
>>> Stack: openais
>>> Current DC: Vindemiatrix - partition with quorum
>>> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
>>> 2 Nodes configured, 2 expected votes
>>> 14 Resources configured.
>>> 
>>>
>>> Online: [ Vindemiatrix Malastare ]
>>>
>>> Full list of resources:
>>>
>>>  soapi-fencing-malastare(stonith:external/ovh):Started Vindemiatrix
>>>  soapi-fencing-vindemiatrix(stonith:external/ovh):Started Malastare
>>>  Master/Slave Set: ms_drbd_ocfs2

Re: [Pacemaker] resources not migrating when some are not runnable on one node, maybe because of groups or master/slave clones?

2012-06-22 Thread Andreas Kurz

On 06/21/2012 11:30 PM, David Vossel wrote:
> - Original Message -
>> From: "Phil Frost" 
>> To: pacemaker@oss.clusterlabs.org
>> Sent: Tuesday, June 19, 2012 4:25:53 PM
>> Subject: Re: [Pacemaker] resources not migrating when some are not runnable 
>> on one node, maybe because of groups or
>> master/slave clones?
>>
>> On 06/19/2012 04:31 PM, David Vossel wrote:
>>> Can you attach a crm_report of what happens when you put the two
>>> nodes in standby please?  Being able to see the xml and how the
>>> policy engine evaluates the transitions is helpful.
>>
>> The resulting reports were a bit big for the list, so I put them in a
>> bug report:
>>
>> https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2652
> 
> We're reporting pacemaker issues to bugs.clusterlabs.org now.
> 
> I took a look at the cib in case2 and saw this in the status for storage02.
> 
>   
> 
>value="true"/>
>name="master-drbd_nfsexports:1" value="10"/>
> 
>   
> 
> storage02 will not give up the drbd master since it has a higher score that 
> storage01.  This coupled with the colocation rule between test and the drbd 
> master, and the location rule to never run "test" on storage02 cause the 
> "test" resource to never run "test" has to run with the drbd master, and 
> the drbd master is stuck because of the transient attributes on a node "test" 
> can't run on, so "test" can't start.
> 
> I don't understand why the transient attribute is there, or where it came 
> from yet.

This is added by the RA with the crm_master command. For example the
drbd RA chooses this value from the current state of drbd to let
Pacemaker promote best candidate.

Regards,
Andreas

> 
> 
> -- Vossel
> 
> 
>> I've also found a similar discussion in the archives, though I didn't
>> find much help in it:
>>
>> http://oss.clusterlabs.org/pipermail/pacemaker/2010-November/008189.html
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 



-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] "ERROR: Wrong stack o2cb" when trying to start o2cb service in Pacemaker cluster

2012-06-20 Thread Andreas Kurz

er.html), and I don't
> know what it does, so, by default, I stupidly followed the official
> guide. What does this meta-attribute sets? If you know a better guide,
> could you please tell me about, so I can check my config based on this
> other guide?

Well, than this is a documentation bug ... you will find the correct
configuration in the same guide, where pacemaker integration is
described ... "notify" sends out notification messages before and after
an instance of the DRBD OCF RA exectutes an action (like start, stop,
promote, demote) ... that allows the other instances to react.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> 
> And, last but not least, I run Debian Squeeze 3.2.13-grsec--grs-ipv6-64.
> 
> Thank you in advance.
> 
> Kind regards.
> 
> PS: if you find me a bit rude, please accept my apologies; I'm working
> on it for weeks following the official DRBD guide and it's frustrating
> to ask help as a last resort and to be answered with something which
> sounds like "What's this bloody mess !?!" to my tired nerve cells. Once
> again, please accept my apologies.
> 
> Le 20/06/2012 15:09, Andreas Kurz a écrit :
>> On 06/20/2012 02:22 PM, David Guyot wrote:
>>> Hello.
>>>
>>> Oops, an omission.
>>>
>>> Here comes my Pacemaker config :
>>> root@Malastare:/home/david# crm configure show
>>> node Malastare
>>> node Vindemiatrix
>>> primitive p_controld ocf:pacemaker:controld
>>> primitive p_drbd_ocfs2_backupvi ocf:linbit:drbd \
>>> params drbd_resource="backupvi"
>>> primitive p_drbd_ocfs2_pgsql ocf:linbit:drbd \
>>> params drbd_resource="postgresql"
>>> primitive p_drbd_ocfs2_svn ocf:linbit:drbd \
>>> params drbd_resource="svn"
>>> primitive p_drbd_ocfs2_www ocf:linbit:drbd \
>>> params drbd_resource="www"
>>> primitive p_o2cb ocf:pacemaker:o2cb \
>>> meta target-role="Started"
>>> primitive soapi-fencing-malastare stonith:external/ovh \
>>> params reversedns="ns208812.ovh.net"
>>> primitive soapi-fencing-vindemiatrix stonith:external/ovh \
>>> params reversedns="ns235795.ovh.net"
>>> ms ms_drbd_ocfs2_backupvi p_drbd_ocfs2_backupvi \
>>> meta master-max="2" clone-max="2"
>>> ms ms_drbd_ocfs2_pgsql p_drbd_ocfs2_pgsql \
>>> meta master-max="2" clone-max="2"
>>> ms ms_drbd_ocfs2_svn p_drbd_ocfs2_svn \
>>> meta master-max="2" clone-max="2"
>>> ms ms_drbd_ocfs2_www p_drbd_ocfs2_www \
>>> meta master-max="2" clone-max="2"
>>> location stonith-malastare soapi-fencing-malastare -inf: Malastare
>>> location stonith-vindemiatrix soapi-fencing-vindemiatrix -inf: Vindemiatrix
>>> property $id="cib-bootstrap-options" \
>>> dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
>>> cluster-infrastructure="openais" \
>>> expected-quorum-votes="2"
>>>
>> I have absolutely no idea why your configuration can run at all without
>> more errors ... do you start the drbd resources manually before the cluster?
>>
>> You are missing the notify meta-attribute for all your DRBD ms
>> resources, you have no order and colocation constraints or groups at all
>> and you don't clone controld and o2cb ... and there are no ocfs2 mounts?
>>
>> Also quite important: what distribution are you using?
>>
>>> The STONITH resources are custom ones which use my provider SOAP API to
>>> electrically reboot fenced nodes.
>>>
>>> Concerning the web page you talked me about, I tried to insert the
>>> referred environment variable, but it did not solved the problem :
>> Really have a look at the crm configuration snippet on that page and
>> read manuals about setting up DRBD in Pacemaker.
>>
>> Regards,
>> Andreas
>>
>>> root@Malastare:/home/david# crm_mon --one-shot -VroA
>>> 
>>> Last updated: Wed Jun 20 14:14:41 2012
>>> Last change: Wed Jun 20 09:22:39 2012 via cibadmin on Malastare
>>> Stack: openais
>>> Current DC: Vindemiatrix - partition with quorum
>>> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
>>> 2 Nodes configured, 2 expected votes
>>> 12 Resources configured.
>>> 
>>>
>>> Online: [ Vindemiatrix Malastare ]
>>>

Re: [Pacemaker] "ERROR: Wrong stack o2cb" when trying to start o2cb service in Pacemaker cluster

2012-06-20 Thread Andreas Kurz

gt;p_controld: migration-threshold=100
> + (10) start: rc=0 (ok)
>p_o2cb: migration-threshold=100
> + (4) probe: rc=5 (not installed)
>p_drbd_ocfs2_pgsql:1: migration-threshold=100
> + (6) probe: rc=8 (master)
>p_drbd_ocfs2_backupvi:1: migration-threshold=100
> + (7) probe: rc=8 (master)
>p_drbd_ocfs2_svn:1: migration-threshold=100
> + (8) probe: rc=8 (master)
>p_drbd_ocfs2_www:1: migration-threshold=100
> + (9) probe: rc=8 (master)
> 
> Failed actions:
> p_o2cb_start_0 (node=Vindemiatrix, call=11, rc=5, status=complete):
> not installed
> p_o2cb_monitor_0 (node=Malastare, call=4, rc=5, status=complete):
> not installed
> 
> Thank you in advance for your help!
> 
> Kind regards.
> 
> Le 20/06/2012 14:02, Andreas Kurz a écrit :
>> On 06/20/2012 01:43 PM, David Guyot wrote:
>>> Hello, everybody.
>>>
>>> I'm trying to configure Pacemaker for using DRBD + OCFS2 storage, but
>>> I'm stuck with DRBD and controld up and o2cb doggedly displaying "not
>>> installed" errors. To do this, I followed the DRBD guide (
>>> http://www.drbd.org/users-guide-8.3/ch-ocfs2.html), with the difference
>>> that I was forced to disable DRBD fencing because it was interfering
>>> with Pacemaker fencing and stopping each nodes as often as it could.
>> Unfortunately you didn't share your Pacemaker configuration but you
>> definitely must not start any ocfs2 init script but let all be managed
>> by the cluster-manager.
>>
>> Here is a brief setup description, also mentioning the tune.ocfs2 when
>> the Pacemaker stack is running:
>>
>> http://www.hastexo.com/resources/hints-and-kinks/ocfs2-pacemaker-debianubuntu
>>
>> And once this is running as expected you really want to reactivate the
>> DRBD fencing configuration.
>>
>> Regards,
>> Andreas
>>
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 



-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] "ERROR: Wrong stack o2cb" when trying to start o2cb service in Pacemaker cluster

2012-06-20 Thread Andreas Kurz

On 06/20/2012 01:43 PM, David Guyot wrote:
> Hello, everybody.
> 
> I'm trying to configure Pacemaker for using DRBD + OCFS2 storage, but
> I'm stuck with DRBD and controld up and o2cb doggedly displaying "not
> installed" errors. To do this, I followed the DRBD guide (
> http://www.drbd.org/users-guide-8.3/ch-ocfs2.html), with the difference
> that I was forced to disable DRBD fencing because it was interfering
> with Pacemaker fencing and stopping each nodes as often as it could.

Unfortunately you didn't share your Pacemaker configuration but you
definitely must not start any ocfs2 init script but let all be managed
by the cluster-manager.

Here is a brief setup description, also mentioning the tune.ocfs2 when
the Pacemaker stack is running:

http://www.hastexo.com/resources/hints-and-kinks/ocfs2-pacemaker-debianubuntu

And once this is running as expected you really want to reactivate the
DRBD fencing configuration.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Now, as I said, I'm stuck with these errors (Malastare and Vindemiatrix
> being the 2 nodes of my cluster) :
> Failed actions:
> p_o2cb_start_0 (node=Vindemiatrix, call=11, rc=5, status=complete):
> not installed
> p_o2cb_monitor_0 (node=Malastare, call=4, rc=5, status=complete):
> not installed
> 
> Looking into logs, I find these messages :
> o2cb[19904]:2012/06/20_09:04:35 ERROR: Wrong stack o2cb
> o2cb[19904]:2012/06/20_09:04:35 ERROR: Wrong stack o2cb
> 
> I tried to manually test ocf:pacemaker:o2cb, but I got this result :
> root@Malastare:/home/david# export OCF_ROOT="/usr/lib/ocf"
> root@Malastare:/home/david# /usr/lib/ocf/resource.d/pacemaker/o2cb monitor
> o2cb[22387]: ERROR: Wrong stack o2cb
> root@Malastare:/home/david# echo $?
> 5
> 
> I tried the solution described on this message (
> http://oss.clusterlabs.org/pipermail/pacemaker/2009-December/004112.html),
> but tunefs.ocfs2 failed :
> root@Malastare:/home/david# cat /etc/ocfs2/cluster.conf
> node:
> name = Malastare
> cluster = ocfs2
> number = 0
> ip_address = 10.88.0.1
> ip_port = 
> node:
> name = Vindemiatrix
> cluster = ocfs2
> number = 1
> ip_address = 10.88.0.2
> ip_port = 
> cluster:
> name = ocfs2
> node_count = 2
> root@Malastare:/home/david# /etc/init.d/ocfs2 start
> root@Malastare:/home/david# tunefs.ocfs2 --update-cluster-stack /dev/drbd1
> Updating on-disk cluster information to match the running cluster.
> DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS
> FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION.
> Update the on-disk cluster information? yes
> tunefs.ocfs2: Unable to access cluster service - unable to update the
> cluster stack information on device "/dev/drbd1"
> root@Malastare:/home/david# /etc/init.d/o2cb start
> root@Malastare:/home/david# tunefs.ocfs2 --update-cluster-stack /dev/drbd1
> Updating on-disk cluster information to match the running cluster.
> DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS
> FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION.
> Update the on-disk cluster information? yes
> tunefs.ocfs2: Unable to access cluster service - unable to update the
> cluster stack information on device "/dev/drbd1"
> 
> I also tried to manually set the cluster stack using "echo pcmk >
> /sys/fs/ocfs2/cluster_stack" on both nodes, then restarting Corosync and
> Pacemaker, but error messages stayed. Same thing with resetting OCFS2
> FSs while cluster is on-line. Now I'm stuck, desperately waiting for
> help  Seriously, if some do-gooder would like to help me, I would
> greatly appreciate his or her help.
> 
> Thank you in advance for your answers.
> 
> Kind regards.
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [solved] stopping resource stops others in colocation / order sets

2012-06-18 Thread Andreas Kurz

On 06/15/2012 06:19 PM, Phil Frost wrote:
> On 06/15/2012 11:55 AM, David Vossel wrote:
>>> If resC is stopped
>>> resource stop resC
>>>
>>> then drbd_nfsexports is demoted, and resB and resC will stop. Why is
>>> that? I'd expect that resC, being listed last in both the colocation
>>> and
>> It is the order constraint.
>>
>> Order constraints are symmetrical. If you say to do these things in
>> this order
>>
>> 1. promote drbd
>> 2. start resB
>> 3. start rscC
>>
>> Then the opposite is also true.  If you want to demote drbd it the
>> following has to happen first.
>>
>> 1. stop rscC
>> 2. stop resB
>> 3. demote drbd
>>
>> You can get around this by using the symmetrical option for your order
>> constraints.
> 
> True, but I wasn't demoting DRBD; I was stopping resource C (mostly to
> see what would happen if for whatever reason, it couldn't run anywhere).
> The order constraint alone doesn't explain the behavior I saw. However,
> I think I've pieced together what was actually happening. This constraint:
> 
> colocation colo inf: drbd_nfsexports_ms:Master resB resC

Yes, unfortunately this is quite confusing in the crm shell syntax.
Because a resource-set can only have one specific role, crm shell
creates in this example two resource-sets. One for the drbd resource and
one for the two simple resources.

So you have two sets. Within the "simple" resource-set resB is more
significant then resC (like in a group).

Between the two resource-sets, drbd depends on the "simple" resource-set
(needs to be Master on the same host) and will be demoted as soon as (at
least) resC is stopped or not runable.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> 
> means, very confusingly, that to be a master, resB and resC must be
> active. Also, for resC to be active, resB must be active. In other words
> (with some transitive reduction applied):
> 
> drbd -> resC -> resB
> 
> It doesn't make any sense, and the documentation is wrong (or at least
> self-conflicting). But that's what it really does mean. What was really
> confusing is that were it not for the :Master modifier, then crm would
> make only one resource set, and it would mean something else. So when I
> was testing with dummy resources, I was only more confused. I made
> another post explaining it more.
> 
> So, DRBD can't be master unless resC and resB are active. And as you
> explained, the order constraint will stop resC and resB if DRBD is
> demoted. Combined, that means either all these things run, or none:
> 
> - resC is stopped, or can't run anywhere
> - (by colocation constraint) DRBD may not be master. Demote it.
> - (by order constraint) stop resC (actually a no-op: it's already stopped)
> - (by order constraint) stop resB
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org







signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] resources not migrating when some are not runnable on one node, maybe because of groups or master/slave clones?

2012-06-18 Thread Andreas Kurz

On 06/18/2012 04:14 PM, Vladislav Bogdanov wrote:
> 18.06.2012 16:39, Phil Frost wrote:
>> I'm attempting to configure an NFS cluster, and I've observed that under
>> some failure conditions, resources that depend on a failed resource
>> simply stop, and no migration to another node is attempted, even though
>> a manual migration demonstrates the other node can run all resources,
>> and the resources will remain on the good node even after the migration
>> constraint is removed.
>>
>> I was able to reduce the configuration to this:
>>
>> node storage01
>> node storage02
>> primitive drbd_nfsexports ocf:pacemaker:Stateful
>> primitive fs_test ocf:pacemaker:Dummy
>> primitive vg_nfsexports ocf:pacemaker:Dummy
>> group test fs_test
>> ms drbd_nfsexports_ms drbd_nfsexports \
>> meta master-max="1" master-node-max="1" \
>> clone-max="2" clone-node-max="1" \
>> notify="true" target-role="Started"
>> location l fs_test -inf: storage02
>> colocation colo_drbd_master inf: ( test ) ( vg_nfsexports ) (
>> drbd_nfsexports_ms:Master )
> 
> Sets (constraints with more then two members) are evaluated in the
> different order.

_Between_ several resource-sets, and in this example three sets are
created, the order of evaluation is like for simple colocation
constraints ... so the last/most right one is the most significant.

_Within_ one single default colocation resource-set the resources are
evaluated like in a group, so most significant resource is the
first/most left one.

Try (for a real drbd scenario, where also the order is important):

colocation colo_drbd_master inf: vg_nfsexports test
drbd_nfsexports_ms:Master

order order_drbd_promote_first inf: drbd_nfsexports_ms:promote
vg_nfsexports:start test:start

These examples will automatically create two sets each, because of the
different Roles/actions. I prefer having a look at the resulting xml
syntax to be sure the shell created what I planned ;-)

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> Try
> colocation colo_drbd_master inf: ( drbd_nfsexports_ms:Master ) (
> vg_nfsexports ) ( test )
> 
> 
>> property $id="cib-bootstrap-options" \
>> no-quorum-policy="ignore" \
>> stonith-enabled="false" \
>> dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
>> cluster-infrastructure="openais" \
>> expected-quorum-votes="2" \
>> last-lrm-refresh="1339793579"
>>
>> The location constraint "l" exists only to demonstrate the problem; I
>> added it to simulate the NFS server being unrunnable on one node.
>>
>> To see the issue I'm experiencing, put storage01 in standby to force
>> everything on storage02. fs_test will not be able to run. Now bring
>> storage01, which can satisfy all the constraints, and see that no
>> migration takes place. Put storage02 in standby, and everything will
>> migrate to storage01 and start successfully. Take storage02 out of
>> standby, and the services remain on storage01. This demonstrates that
>> even though there is a clear "best" solution where all resources can
>> run, Pacemaker isn't finding it.
>>
>> So far, I've noticed any of the following changes will "fix" the problem:
>>
>> - removing colo_drbd_master
>> - removing any one resource from colo_drbd_master
>> - eliminating the group "test" and referencing fs_test directly in
>> constraints
>> - using a simple clone instead of a master/slave pair for
>> drbd_nfsexports_ms
>>
>> My current understanding is that if there exists a way to run all
>> resources, Pacemaker should find it and prefer it. Is that not the case?
>> Maybe I need to restructure my colocation constraint somehow? Obviously
>> this is a much reduced version of a more complex practical
>> configuration, so I'm trying to understand the underlying mechanisms
>> more than just the solution to this particular scenario.
>>
>> In particular, I'm not really sure how I inspect what Pacemaker is
>> thinking when it places resources. I've tried running crm_simulate -LRs,
>> but I'm a little bit unclear on how to interpret the results. In the
>> output, I do see this:
>>
>> drbd_nfsexports:1 promotion score on storage02: 10
>> drbd_nfsexports:0 promotion score on storage01: 5
>>
>> those numbers seem to account for the default stickiness of 1 for
>> master/slave resources, but don't seem to incorporate at all the
>> colocation constraints. Is that expected?
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.c

Re: [Pacemaker] pacemaker gfs

2012-06-13 Thread Andreas Kurz

On 06/13/2012 05:38 AM, hcyy wrote:
> thank you for your reply！

don't forget to post to the list

>> Failed actions:
>> 
>> dlm:1_monitor_0 (node=pcmk-2, call=4, rc=5, status=
>> 
>> complete): not installed
>> 
>> dlm:0_monitor_0 (node=pcmk-1, call=4, rc=5, status=
>> 
>> complete): not installed
> you suggest  dlm-pcmk package
> but i use ubuntu and can not find it
> root@pcmk-1:~# apt-cache search dlm

I see ... have a look at https://wiki.ubuntu.com/ClusterStack/Precise

Regards,
Andreas

> libdlm-dev - Red Hat cluster suite - distributed lock manager
> development files
> libdlm3 - Red Hat cluster suite - distributed lock manager library
> libdlmcontrol-dev - Red Hat cluster suite - distributed lock manager
> development files
> libdlmcontrol3 - Red Hat cluster suite - distributed lock manager library
> mame - The Multiple Arcade Machine Emulator - MAME
> mame-tools - MAME tools
> sdlmame - Dummy package to ease transition to mame
> sdlmame-tools - Transitional dummy package to mame-tools
> octave-io - input/output data functions for Octave
> 





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Pacemaker-GFS2-DLM

2012-06-12 Thread Andreas Kurz

On 06/12/2012 04:34 AM, 燕阳 蔡 wrote:
> Hello,
> 
>  
> 
> I'm trying to install gfs2:apt-get install gfs2-utils gfs-pcmk,when i
> add dlm,it show:

you installed the dlm-pcmk package?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Failed actions:
> 
> dlm:1_monitor_0 (node=pcmk-2, call=4, rc=5, status=
> 
> complete): not installed
> 
> dlm:0_monitor_0 (node=pcmk-1, call=4, rc=5, status=
> 
> complete): not installed
> 
> some people advise：
> 
> adding something like this to /etc/fstab:
> 
> debugfs /sys/kernel/debug  debugfs  defaults0 0
> 
>  
> 
> and doing mount -a
> 
> but it does not work.
> 
> Does anyone know how to solve?
> 
>  
> 
> Thanks in advance.
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 







signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Design question. Service running everywhere

2012-06-12 Thread Andreas Kurz

On 06/12/2012 10:17 PM, Arturo Borrero Gonzalez wrote:
> Hi there!
> 
> I've some questions for you.
> 
> I'm deploying a new cluster with a service inside that doesn't matter.
> The important of the service is that is running in everywhere by it's
> own internal functionality.
> So start/stop the resource in a given node is the same that doing it
> in the rest. All nodes will suffer that start/stop effect.
> 
> This is causing me some issues with Pacemaker.
> 
> As far as I know, pacemaker monitor the resource in each node (in this
> case /etc/init.d/service status, because it's an LSB resource).
> But /etc/init.d/service status will return the same as long as the
> service is the same in all nodes.
> 
> I don't know if working with clones will give some advantages. I need
> pacemaker to manage IPVs and other stuff.

Yes, that LSB service running on all nodes is a typical scenario for an
anonymous cloned resource.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Shall I not use pacemaker with this service? Or, do I need some
> special configuration or design options?
> 
> Best regards.
> 







signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] How to change Stack

2012-06-01 Thread Andreas Kurz

On 06/01/2012 01:05 PM, Mars gu wrote:
> hi,
>My cluster:
>   corosync-1.4.1-4.el6_2.2.x86_64
>pacemaker-1.1.6-3.el6.x86_64

use corosync-2 and pacemaker 1.1.7

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

>  
>I want to use votequorum as a qourum provider for corosync. yes, I
> did it.
>but pacemaker did not uses the corosync iface,then I have two problem.   
>  
>   Q1.
>Here,~~~  /lib/cluster/cluster.c  gboolean
> /crm_connect_corosync(void)
>I add a log
> ***
>  if (is_openais_cluster()) {
> +   crm_info("I use openais iface!");
>  crm_set_status_callback(&ais_status_callback);
>  rc = crm_cluster_connect(&fsa_our_uname, &fsa_our_uuid,
> crmd_ais_dispatch, crmd_ais_destroy,
>   NULL);
>  }
>  if (rc && is_corosync_cluster()) {
> +   crm_info("I use corosync iface!");
>  init_quorum_connection(crmd_cman_dispatch, crmd_quorum_destroy);
>  }
>  if (rc && is_cman_cluster()) {
> +   crm_info("I use cman iface!");
>  init_cman_connection(crmd_cman_dispatch, crmd_cman_destroy);
>  set_bit_inplace(fsa_input_register, R_CCM_DATA);
>  }
> diff -Nuar ClusterLabs-pacemaker-a02c0f1/lib/cluster/cluster.c
> ClusterLabs-pacemaker-a02c0f1-new/lib/cluster/cluster.c
> **
>It believe that I am a openais cluster .
> *
> Jun  1 08:45:05 h10_147 crmd: [2014]: info: get_cluster_type: Cluster
> type is: 'openais'
> Jun  1 08:45:05 h10_147 crmd: [2014]: info: crm_connect_corosync: I use
> openais iface!
> *
>  when I execute this cli, you can find the Stack
> [root@h10_150 SOURCES]# crm status
> 
> Last updated: Fri Jun  1 19:00:02 2012
> Last change: Thu May 31 14:05:04 2012 via crmd on 10_146
> Stack: openais
> Current DC: h10_150 - partition WITHOUT quorum
> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
> 3 Nodes configured, 3 expected votes
> 12 Resources configured.
> 
> How to change Stack ???
>  
>  
> Q2
> I find something confusing  in  pacemaker.spec
> ***
> # Supported cluster stacks, must support at least one
> %bcond_without cman
> %bcond_without doc
> %bcond_without corosync
> %bcond_with heartbeat
> *
> Must I make a change here?
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Tomcat and "unmanaged failed"

2012-05-30 Thread Andreas Kurz

On 05/30/2012 03:41 PM, Stallmann, Andreas wrote:
> Hi!
> 
>> Yes, that is default behaviour ... Pacemaker tries to stop, that fails so it 
>> must assume (worst case) it is still running, now STONITH would trigger to 
>> make sure the node including the resource is definitely down ... without 
>> STONITH it stays unmanaged until cleared.
> 
> STONITH is not possible in our scenario, unfortunately. I'd rather have the 
> tomcat script try to "stop" the daemon twice and if this does not succeed get 
> the PID and kill it hard (and then report "0" on the stop action to crm). 
> Should be possible by changing the script, shouldn't it?

Of course, this is possible ... please share your changes with the
community ;-)

> 
>> First attempt should be to fix your application. There is also the 
>> "failure-timeout" resource meta-attribute ... in combination with the 
>> cluster-recheck-interval cluster property, this clears resource failures on 
>> a regular base.
> 
> Our developers are currently trying to fix the bug, but it might remain for 
> some time. Thus we need an interim solution. I'll try the property and 
> attribut you named. Still, this won't help if the resource remains 
> "un-stop-able", right?
> 

It will also "forget" stop-failures.

Regards,
Andreas
-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> Cheers,
> 
> Andreas
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
> 
> 
>>
>>
>>
>> Cheers,
>>
>>
>>
>> Andreas
>>
>> --
>> CONET Solutions GmbH
>> Andreas Stallmann,
>> Theodor-Heuss-Allee 19, 53773 Hennef
>> Tel.: +49 2242 939-677, Fax: +49 2242 939-393
>> Mobil: +49 172 2455051
>> Internet: http://www.conet.de, mailto: astallm...@conet.de
>> 
>>
>>
>>
>> 
>> CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
>> Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
>> Geschäftsführer/Managing Director: Anke Höfer
>>
>>  
>>
>>
>>
>>
>>
>>
> 
> -
> CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
> Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
> Geschäftsführer/Managing Directors: Anke Höfer
> -
> 
> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] pacemaker with corosync on Fedora

2012-05-30 Thread Andreas Kurz

On 05/30/2012 07:07 PM, Lutz Griesbach wrote:
> Hi there,
> 
> 
> im trying to setup a cluster on  fedora17
> 
> corosync-2.0.0-1.fc17.i686
> pacemaker-1.1.7-2.fc17.i686
> 
> as i understand pacemaker packages are built without heartbeat
> 
> [root@lgr-fed17-1 ~]# pacemakerd --features
> Pacemaker 1.1.7-2.fc17 (Build: ee0730e13d124c3d58f00016c3376a1de5323cff)
>  Supporting:  generated-manpages agent-manpages ascii-docs
> publican-docs ncurses libqb-logging  corosync-native
> 
> 
> however, when i start pacemaker i get an error:
> 
> May 25 12:14:08 [22708] lgr-fed17-1   crmd:   notice:
> init_quorum_connection:   Quorum acquired
> May 25 12:14:08 [22708] lgr-fed17-1   crmd: info:
> do_ha_control:Connected to the cluster
> May 25 12:14:08 [22708] lgr-fed17-1   crmd: info: do_started:
>  Delaying start, no membership data (0010)
> May 25 12:14:08 [22708] lgr-fed17-1   crmd:   notice:
> crmd_peer_update: Status update: Client lgr-fed17-1/crmd now has
> status [online] (DC=)
> May 25 12:14:08 [22708] lgr-fed17-1   crmd: info:
> pcmk_quorum_notification: Membership 0: quorum retained (0)
> May 25 12:14:08 [22708] lgr-fed17-1   crmd:error:
> check_dead_member:We're not part of the cluster anymore
> May 25 12:14:08 [22708] lgr-fed17-1   crmd:error: do_log:
>  FSA: Input I_ERROR from check_dead_member() received in state
> S_STARTING
> May 25 12:14:08 [22708] lgr-fed17-1   crmd:   notice:
> do_state_transition:  State transition S_STARTING -> S_RECOVERY [
> input=I_ERROR cause=C_FSA_INTERNAL origin=check_dead_member ]
> May 25 12:14:08 [22708] lgr-fed17-1   crmd:error: do_recover:
>  Action A_RECOVER (0100) not supported
> May 25 12:14:08 [22708] lgr-fed17-1   crmd:error: do_started:
>  Start cancelled... S_RECOVERY
> 
> 
> i found in the list-archives, that "Action A_RECOVER
> (0100) not supported"  is related to heartbeat
> 
> http://oss.clusterlabs.org/pipermail/pacemaker/2011-August/011289.html
> 
> but how can i configure pacemaker to work with corosync and not with 
> heartbeat?
> 
> 
> Or am i missing something completely?

yes ... a quorum provider and there are no plugins anymore in Corosync
2.0 ... for two nodes you can use:

quorum {
   provider: corosync_votequorum
   expected_votes: 2
}

> 
> 
> corosync.conf:
> 
> totem {
> version: 2
> 
> crypto_cipher: none
> crypto_hash: none
> 
> interface {
> ringnumber: 0
> bindnetaddr: 10.1.1.23
> mcastaddr: 226.94.1.1
> mcastport: 4000
> ttl: 1
> }
> }
> 
> 
> service.d/pacemaker
> 
> {
> # Load the Pacemaker Cluster Resource Manager
> name: pacemaker
> ver:  1
> }

not needed anymore ... Pacemaker uses CPG with Corosync 2

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> 
> kind regards,
> Lutz
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Tomcat and "unmanaged failed"

2012-05-30 Thread Andreas Kurz

Hi Andreas,

On 05/29/2012 04:14 PM, Stallmann, Andreas wrote:
> Hi there,
> 
>  
> 
> we have here a corosync/pacemaker cluster running tomcat. Sometimes our
> application running inside tomcat fails and tomcat dies.
> 
>  
> 
> This – for some reason I don’t understand – leads to an “unmanaged
> failed” state for tomcat diplayed in crm_mon. This would not been to
> bad, but at this point the cluster “decides” not to failover the
> resource to the second node.
> 
>  
> 
> My questions:
> 
>  
> 
> 1.   Is this a standard behaviour? Should a failover stop (or not
> take place at all), if a resource runs into an unmanaged failed state?

Yes, that is default behaviour ... Pacemaker tries to stop, that fails
so it must assume (worst case) it is still running, now STONITH would
trigger to make sure the node including the resource is definitely down
... without STONITH it stays unmanaged until cleared.

> 
> 2.   What conditions have to apply, before a resource is called
> “unmanaged failed”?

e.g. stop failures ;-)

> 
> 3.   Is there any way of an “automatic recover” of a resource that
> ran into an “unmanaged failed” state?

First attempt should be to fix your application. There is also the
"failure-timeout" resource meta-attribute ... in combination with the
cluster-recheck-interval cluster property, this clears resource failures
on a regular base.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> 
>  
> 
> Cheers,
> 
>  
> 
> Andreas
> 
> --
> CONET Solutions GmbH
> Andreas Stallmann,
> Theodor-Heuss-Allee 19, 53773 Hennef
> Tel.: +49 2242 939-677, Fax: +49 2242 939-393
> Mobil: +49 172 2455051
> Internet: http://www.conet.de, mailto: astallm...@conet.de
> 
> 
>  
> 
> 
> CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
> Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
> Geschäftsführer/Managing Director: Anke Höfer 
> 
>  
> 
>  
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] CentOS pacemaker heartbeat

2012-04-30 Thread Andreas Kurz

On 04/30/2012 05:37 PM, fatcha...@gmx.de wrote:
> Hi,
> 
> I´ve just installed a CentOS 6.2 and also installed via epel-repo
> heartbeat-3.0.4-1.el6.x86_64 and 
> pacemaker-1.1.6-3.el6.x86_64. 
> I try to start heartbeat (crm respawn in ha.cf) and I get this error:
> 
> crmd: [2462]: CRIT: get_cluster_type: This installation of Pacemaker does not 
> support the 'heartbeat' cluster infrastructure.  Terminating. 
> 
> Did I missed something ? I thought heart 3x and pacemaker 1x won´t be a 
> problem ?
> 

The pacemaker version shipped with CentOS 6.2 is compiled without
heartbeat support ... you don't need epel-repo to get Pacemaker, it's
already included in RHEL (as technology preview).

Use corosync instead of Hearbeat.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> Any suggestions are welcome.
> 
> Kind regards
> 
> fatcharly
> 






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] LVM restarts after SLES upgrade

2012-04-26 Thread Andreas Kurz

On 04/25/2012 11:00 AM, Frank Meier wrote:
> Am 24.04.2012 17:53, schrieb pacemaker-requ...@oss.clusterlabs.org:
> 
>> Message: 2
>> Date: Tue, 24 Apr 2012 15:58:53 +
>> From: "Daugherity, Andrew W" 
>> To: "" 
>> Subject: Re: [Pacemaker] LVM restarts after SLES upgrade
>> Message-ID: <114ad516-3da6-43e1-8d15-f5d9d3eaa...@tamu.edu>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> On Apr 24, 2012, at 4:28 AM, 
>>   wrote:
>>
>>> Date: Tue, 24 Apr 2012 09:34:12 +
>>> From: emmanuel segura 
>>> Message-ID:
>>>   
>>>
>>> Hello Frank
>>>
>>> Maybe this it's not the probelem, but i see this constrain wrong from
>>> my point of view
>>> =
>>> order o-Testclustervm inf: c-xen-vg-fs vm-clusterTest
>>> order o-clvmglue-xenvgfs inf: c-clvm-glue c-xen-vg-fs
>>> =
>>> to be
>>> =
>>> order o-clvmglue-xenvgfs inf: c-clvm-glue c-xen-vg-fs
>>> order o-Testclustervm inf: c-xen-vg-fs vm-clusterTest
>>> =
>>
>> How is that any different?  Both sets of order constraints are identical, 
>> and look correct.  Changing the order you add them in makes no difference, 
>> as the rules are evaluated as a set, and the crm shell will reorder them in 
>> alphabetical (ASCIIbetical, actually) order anyway.
>>
>>
>>> 2012/4/24, Frank Meier :
 Every time the vgdisplay -v TestXenVG is hanging(ca.2min)

 I see two of this peocesses:
 /bin/sh /usr/lib/ocf/resource.d//heartbeat/LVM monitor
 /bin/sh /usr/lib/ocf/resource.d//heartbeat/LVM monitor
 is this OK, or have we a race condition?
>>
>> Frank, I see you have multipath in your LVM config.  Have you tried it with 
>> multipath disabled?  I wonder if this isn't a pacemaker/corosync problem but 
>> rather a lower-level storage problem.  Still, whatever the cause, it doesn't 
>> fill me with confidence about upgrading to SLES 11 SP2... I guess it's time 
>> to bring up that test cluster I've been meaning to build.
>>
>> -Andrew
>>
> 
> Hi,
> 
> yes, I've tested now without multipathd, but the problem exist furthermore.

You already found this thread?

http://lists.linux-ha.org/pipermail/linux-ha/2011-November/044267.html

 there was also another discussion I can't find atm regarding
possible tunings like i/o scheduler and lvm filter changes.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Order constrain with resouce group

2012-04-23 Thread Andreas Kurz

On 04/23/2012 01:12 PM, emmanuel segura wrote:
> Hello List
> 
> I would like to know if it's possible make one order constrain with a
> resource group

yes, you can also reference a group in a constraint.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> For esample: order myorder inf: active_active_clone
> resourcegroup_not_active_failover_mode
> 
> thanks






signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Corosync / Pacemaker Cluster crashing

2012-04-20 Thread Andreas Kurz

On 04/20/2012 12:08 PM, Bensch, Kobus wrote:
> Hi
> 
> I have the following cluster setup:
> 
> 2 physical Dell servers with RHEL6.2 with all the latest patches.
> 
> Each server has 3 network connections that looks like this:
> 
> BOND02 NIC's
> 
> ETH4 for Corosync
> ETH6 for corosync
> 
> This is the corosync config:
> Cocorsync.conf
> aisexec {
> group:root
> user:root
> }
> 
> compatibility: whitetank
> service {
> use_mgmtd:yes
> use_logd:yes
> ver:0
> name:pacemaker
> }

you also specified that service in /etc/corosync/service.d/pcmk ...
remove one of them ... even better: remove that definition above and
install Pacemaker 1.1.6 and Corosync 1.4.x packages that are available
as technology preview in RHEL 6.2

> totem {
> rrp_mode:active
> join:180
> max_messages:20
> vsftype:none
> token:5000
> consensus:6000
> secauth:on
> token_retransmits_before_loss_const:10
> threads:0
> #threads:16
> version:2
> interface {
> bindnetaddr:10.255.1.0
> mcastaddr:232.10.1.1
> mcastport:5405
> ringnumber:0
> ttl:1
> }
> interface {
> bindnetaddr:10.255.2.0
> mcastaddr:232.10.2.1
> mcastport:5405
> ringnumber:1
> ttl:1
> }
> clear_node_high_bit:yes
> }
> logging {
> to_logfile:yes
> to_syslog:yes
> debug:off
> timestamp:on
> logfile: /var/log/cluster/corosync.log
> to_stderr:no
> fileline:off
> syslog_facility:daemon
> }
> amf {
> mode:disabled
> }
> 
> The pacemaker plugin:
> /etc/corosync/service.d/pcmk
> service {
> # Load the Pacemaker Cluster Resource Manager
> name: pacemaker
> ver:  1
> }
> 
> Corosync keeps crashing when I try to do anything in the crm cli.
> Whether it is moving resources, creating resources, it does not matter.
> 
> The corosync config for now is very simple and looks like this:
> node lxdcv01nd01
> node lxdcv01nd02
> primitive lcdcv01 ocf:heartbeat:IPaddr2 \
> params ip="10.1.0.95" cidr_netmask="32" \
> op monitor interval="30s"
> primitive local-manage ocf:heartbeat:IPaddr2 \
> params ip="127.0.2.1" cidr_netmask="32" \
> op monitor interval="30s"
> location cli-prefer-lcdcv01 lcdcv01 \
> rule $id="cli-prefer-rule-lcdcv01" inf: #uname eq lxdcv01nd02
> location cli-prefer-local-manage local-manage \
> rule $id="cli-prefer-rule-local-manage" inf: #uname eq lxdcv01nd02
> property $id="cib-bootstrap-options" \
> dc-version="1.0.12-unknown" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith-enabled="false" \
> no-quorum-policy="ignore"
> 
> I tried to disable various config lines but still no joy. Any help would
> be appreciated.
> 
> When the server crashes I get this in the log:
> Apr 20 10:54:17 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:17 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:17 corosync [TOTEM ] FAILED TO RECEIVE

There have been problems with delayed mcast messages that could lead to
such errors, though that has been in older corosync versions ... should
not happen in recent corosync versions. See
http://answerpot.com/showthread.php?1361794-corosync+crashes

Another point for upgrading to recent versions ;-)

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> Apr 20 10:54:18 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:18 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:18 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:18 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:19 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:19 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:19 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:19 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:20 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:20 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:20 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:20 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:21 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:21 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:21 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:21 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:22 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:22 corosync [TOTEM ] FAILED TO RECEIVE
> Apr 20 10:54:22 lxdcv01nd01.bauer-uk.bauermedia.group crmd: [21259]:
> ERROR: ais_dispatch: Receiving message body failed: (2) Library error:
> Resource temporarily unavailable (11)
> Apr 20 10:54:22 lxdcv01nd01.bauer-uk.bauermedia.group stonithd: [21254]:
> ERROR: ais_dispatch: Receiving message body failed: (2) Library error:
> Invalid argument (22)
> Apr 20 10:54:22 lxdcv01nd01.bauer-uk.bauermedia.group attrd: [21257]:
> ERROR: ais_dispatch: Receiving message body failed: (2) Library error:
> Resource temporarily unavailable (11)
> Apr 20 10:54:22 lxdcv01nd01.bauer-uk.bauermedia.group cib: [21255]:
> ERROR: ais_dispatch: Receiving message body failed: (2) Library error:
> Resource temporarily unavailable (11)
> Apr 20 10:54:22 lxdcv01nd01.bauer-uk.bauermedia.group crmd: [21259]:
> ERROR: ais_dispatch: AIS connection failed
> Apr 20 10:54:22 lxdcv01nd01.bauer-uk.bauermedia.grou

Re: [Pacemaker] OCF Resource agent monitor activity failed due to temporary error

2012-04-19 Thread Andreas Kurz

On 04/19/2012 01:59 PM, Kulovits Christian - OS ITSC wrote:
> Hi Andreas,
> Exactly this is what i want pacemaker to do when my RA is not able to 
> determine the resource´s state. But without running into timeout and restart.
> It's the method to display the resource´s state that is unavailable not the 
> resource itself. This typically approach must be coded in every RA instead of 
> once in pacemaker.

You want pacemaker to ignore monitor errors on all unknown return values
and go on with monitoring until a resource "heals" itself?

 please rethink ... it is a resource agents work to reliable tell
pacemaker the definite resource state -- and "uhm, hm, don't know now
please try later" can be everything -- and how to find that out is very
specific depending on the resource. IMHO that makes no sense at all to
let the cluster manager do this work.

There may be cases were a "degraded" resource state may be a nice
feature and is already a topic here on the list ... from time to time.

Regards,
Andreas

> Christian
> 
> -Original Message-
> From: Andreas Kurz [mailto:andr...@hastexo.com] 
> Sent: Donnerstag, 19. April 2012 13:51
> To: pacemaker@oss.clusterlabs.org
> Subject: Re: [Pacemaker] OCF Resource agent monitor activity failed due to 
> temporary error
> 
> Hi Christian,
> 
> On 04/19/2012 01:38 PM, Kulovits Christian - OS ITSC wrote:
>> Hi, Andreas
>>
>> What if the RA gets a response from an external command in the form: 
>> "display currently unavailable, try later". The RA has 3 possibly states 
>> available, "Running", "Not Running", "Failed". But in this situation he 
>> would say "don't know". When I set "on-fail=ignore" this error will be 
>> ignored the same way as when response is "not running" and the resource will 
>> never be restarted.
>> Christian
> 
> A typically approach is to wait a little bit and retry the monitor
> command until it succeeds to deliver a valid status (running/not
> running) or the RA monitor operation timeouts and the script is killed
> including resource recovery.
> 
> Regards,
> Andreas
> 

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

1 2 3 >

1 - 100 of 290 matches

Mail list logo