[ClusterLabs] Azure Resource Agent

2017-09-15 Thread Eric Robinson
Greetings, all --

If anyone's interested, I wrote a resource agent that works with Microsoft 
Azure. I'm no expert at shell scripting, so I'm certain it needs a great deal 
of improvement, but I've done some testing and it works with a 2-node cluster 
in my Azure environment. Offhand, I don't know any reason why it wouldn't work 
with larger clusters, too.

My colocation stack looks like this:

mysql -> azure_ip -> cluster_ip -> filesystem -> drbd

Failover takes up to 4 minutes because it takes that long for the Azure IP 
address de-association and re-association to complete. None of the delay is the 
fault of the cluster itself.

Right now the script burps a bunch of debug output to syslog, which is helpful 
if you feel like you're waiting forever for the cluster to failover, you can 
look at /var/log/messages and see that you're waiting for the Azure cloud to 
finish something. To eliminate the debug messages, set DEBUG_LEVEL to 0.

The agent requires the Azure client to be installed and the nodes to have been 
logged into the cloud. It currently only works with one NIC per VM, and two 
ipconfigs per NIC (one of which is the floating cluster IP).

This is obviously beta as it currently only works with a manual failover. I 
need to add some code to handle an actual node crash or power-plug test.

Feedback, suggestions, improvements are welcome. If someone who knows awk wants 
to clean up my azure client calls, that would be a good place to start.

--

#!/bin/sh
#
# OCF parameters are as below
# OCF_RESKEY_ip

###
# Initialization:

: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
DEBUG_LEVEL=2
MY_HOSTNAME=$(hostname -s)
SCRIPT_NAME=$(basename $0)

###

meta_data() {
logIt "debug1: entered: meta_data()"
cat <


1.0


Resource agent for managing IP configs in Azure.


Short descrption/





The IPv4 (dotted quad notation)
example IPv4 "192.168.1.1".

IPv4 address













END
logIt "leaving: exiting: meta_data()"
return $OCF_SUCCESS
}

azip_query() {

logIt "debug1: entered: azip_query()"
logIt "debug1: checking to determine if an Azure ipconfig named 
'$AZ_IPCONFIG_NAME' exists for the interface"
logIt "debug1: executing: az network nic ip-config show --name 
$AZ_IPCONFIG_NAME --nic-name $AZ_NIC_NAME -g $AZ_RG_NAME 2>&1"
R=$(az network nic ip-config show --name $AZ_IPCONFIG_NAME --nic-name 
$AZ_NIC_NAME -g $AZ_RG_NAME 2>&1)
logIt "debug2: $R"
R2=$(echo "$R"|grep "does not exist")
if [ -n "$R2" ]; then
logIt "debug1: ipconfig named 
'$AZ_IPCONFIG_NAME' does not exist"
return $OCF_NOT_RUNNING
else
R2=$(echo "$R"|grep "Succeeded")
if [ -n "$R2" ]; then
logIt "debug1: ipconfig 
'$AZ_IPCONFIG_NAME' exists"
return $OCF_SUCCESS
else
logIt "debug1: not sure how 
this happens"
return $OCF_ERR_GENERIC
fi
fi
logIt "debug1: exiting: azip_query()"
}

azip_usage() {
cat <___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] IP clone issue

2017-09-15 Thread Ken Gaillot
On Tue, 2017-09-05 at 21:28 +0300, Vladislav Bogdanov wrote:
> 05.09.2017 17:15, Octavian Ciobanu wrote:
> > Based on ocf:heartbeat:IPaddr2 man page it can be used without an
> static
> > IP address if the kernel has
> net.ipv4.conf.all.promote_secondaries=1.
> >
> > "There must be at least one static IP address, which is not managed
> by
> > the cluster, assigned to the network interface. If you can not
> assign
> > any static IP address on the interface, modify this kernel
> parameter:
> > sysctl -w net.ipv4.conf.all.promote_secondaries=1 (or per device)"
> >
> > This kernel parameter is set by default in CentOS 7.3.
> >
> > With clone-node-max="1" it works as it should be but with
> > clone-node-max="2" both instances of VIP are started on the same
> node
> > even if the other node is online.
> 
> That actually is not a new issue.
> 
> Try raising resource priority 
> (http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explai
> ned/s-resource-options.html#_resource_meta_attributes). 
> That _may_ help.
> Iirc, currently it is the only method to spread globally-unique
> clones 
> across all the nodes at least at the start-up (with higher priority
> they 
> are allocated first, so they land to nodes which have less
> resources).
> 
> But, after the cluster state change (rebooted/fenced node gets
> online) 
> pacemaker tries to preserve resource placement if several nodes have
> the 
> equal 'score' for the given resource. That applies to globally-
> unique 
> clones as well. Changing placement-strategy to utilization or
> balanced 
> does not help as well.
> 
> The only (IMHO) bullet-proof way to make them spread across the
> cluster 
> after node reboot is to make 'synthetic' full-mesh anti-colocation 
> between globally-unique clone instances. Unfortunately, that can be
> made 
> probably only in the pacemaker source code. A possible hack would be
> to 
> anti-colocate clone with itself, but I didn't try that (although the
> is 
> on my todo list) and honestly do not expect that to work. I will
> need 
> the same functionality for the upcoming project (many-nodes 
> active-active cluster with clusterip), so hopefully find a way to 
> achieve that goal in several months.
> 
> (I'm cc'ing Ken directly to draw his attention to this topic).

Yes, unfortunately there is no reliable way at the moment. The priority
suggestion is a good one, though as you mentioned, if failover causes
the instances to land on the same node, they'll stay there even if the
other node comes back up.

There is already a bug report for allowing placement strategy to handle
this:

  https://bugs.clusterlabs.org/show_bug.cgi?id=5220

Unfortunately developer time is extremely limited, so there is no time
frame for dealing with it.

> > Pacemaker 1.1 Cluster from Scratch say that
> > "|clone-node-max=2| says that one node can run up to 2 instances of
> the
> > clone. This should also equal the number of nodes that can host the
> IP,
> > so that if any node goes down, another node can take over the
> failed
> > node’s "request bucket". Otherwise, requests intended for the
> failed
> > node would be discarded."
> >
> > To have this functionality do I must have a static IP set on the
> > interfaces ?
> >
> >
> >
> > On Tue, Sep 5, 2017 at 4:54 PM, emmanuel segura  > > wrote:
> >
> > I never tried to set an virtual ip in one interfaces without
> ip,
> > because the vip is a secondary ip that switch between nodes, 

To clarify, cloning an IP does not switch it between nodes (a regular,
non-cloned IP resource would do that). Cloning an IP load-balances
requests across the clone instances (which may be spread out across one
or more nodes). Cloning an IP requires multicast Ethernet MAC
addresses, which not all switches support or have enabled.

> not
> > primary ip
> >
> > 2017-09-05 15:41 GMT+02:00 Octavian Ciobanu  l.com
> > >:
> >
> > Hello all,
> >
> > I've encountered an issue with IP cloning.
> >
> > Based the "Pacemaker 1.1 Clusters from Scratch" I've
> configured
> > a test configuration with 2 nodes based on CentOS 7.3. The
> nodes
> > have 2 Ethernet cards one for cluster communication with
> private
> > IP network and second for public access to services. The
> public
> > Ethernet has no IP assigned at boot.
> >
> > I've created an IP resource with clone using the following
> command
> >
> > pcs resource create ClusterIP ocf:heartbeat:IPaddr2 params
> > nic="ens192" ip="xxx.yyy.zzz.www" cidr_netmask="24"
> > clusterip_hash="sourceip" op start interval="0"
> timeout="20" op
> > stop interval="0" timeout="20" op monitor interval="10"
> > timeout="20" meta resource-stickiness=0 clone meta clone-
> max="2"
> > clone-node-max="2" interleave="true" globally-unique="true"
> >
> > The 

Re: [ClusterLabs] Cannot stop cluster due to order constraint

2017-09-15 Thread Ken Gaillot
On Fri, 2017-09-08 at 15:31 +1000, Leon Steffens wrote:
> Hi all,
> 
> We are running Pacemaker 1.1.15 under Centos 6.9, and have a simple
> 3-node cluster with 6 sets of "main" and "backup" resources (just
> Dummy ones):
> 
> main1
> backup1
> main2
> backup2
> etc.
> 
> We have the following co-location constraint between main1 and
> backup1 (-200 because we don't want them to be on the same node, but
> under some circumstances they can end up on the same node)
> 
> pcs constraint colocation add backup1 with main1 -200
> 
> We also have the following order constraint between main1 and
> backup1.  This caters for the scenario where they end up on the same
> node - we want to make sure that "main" gets started before "backup"
> gets stopped, and started somewhere else (because of co-location
> score):
> 
> pcs constraint order start main1 then stop backup1 kind=Serialize

I think you want kind=Optional here. "Optional" means that if both
actions are needed in the same transition, perform them in this order,
otherwise it doesn't limit anything. "Serialize" means the start and
stop can happen in either order, but not simultaneously, and backup1
can't stop unless main1 is starting.

> When the cluster is started, everything works fine:
> 
> main1   (ocf::heartbeat:Dummy): Started straddie1
> main2   (ocf::heartbeat:Dummy): Started straddie2
> main3   (ocf::heartbeat:Dummy): Started straddie3
> main4   (ocf::heartbeat:Dummy): Started straddie1
> main5   (ocf::heartbeat:Dummy): Started straddie2
> main6   (ocf::heartbeat:Dummy): Started straddie3
> backup1 (ocf::heartbeat:Dummy): Started straddie2
> backup2 (ocf::heartbeat:Dummy): Started straddie1
> backup3 (ocf::heartbeat:Dummy): Started straddie1
> backup4 (ocf::heartbeat:Dummy): Started straddie2
> backup5 (ocf::heartbeat:Dummy): Started straddie1
> backup6 (ocf::heartbeat:Dummy): Started straddie2
> 
> When we do a "pcs cluster stop --all", things do not go so well.  pcs
> cluster stop hangs and the cluster state is as follows:
> 
> main1   (ocf::heartbeat:Dummy): Stopped
> main2   (ocf::heartbeat:Dummy): Stopped
> main3   (ocf::heartbeat:Dummy): Stopped
> main4   (ocf::heartbeat:Dummy): Stopped
> main5   (ocf::heartbeat:Dummy): Stopped
> main6   (ocf::heartbeat:Dummy): Stopped
> backup1 (ocf::heartbeat:Dummy): Started straddie2
> backup2 (ocf::heartbeat:Dummy): Started straddie1
> backup3 (ocf::heartbeat:Dummy): Started straddie1
> backup4 (ocf::heartbeat:Dummy): Started straddie2
> backup5 (ocf::heartbeat:Dummy): Started straddie1
> backup6 (ocf::heartbeat:Dummy): Started straddie2
> 
> The corosync.log clearly shows why this is happening.  It looks like
> Pacemaker wants to stop the backup resources, but the order
> constraint states that the "main" resources should be started first. 
> At this stage the "main" resources have already been stopped, and
> because the cluster is shutting down, the "main" resources cannot be
> started, and we are stuck:
> 
> 
> Sep 08 15:15:07 [23862] straddie3       crmd:     info:
> match_graph_event:      Action main1_stop_0 (14) confirmed on
> straddie1 (rc=0)
> Sep 08 15:15:07 [23862] straddie3       crmd:  warning: run_graph:  
>    Transition 48 (Complete=6, Pending=0, Fired=0, Skipped=0,
> Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-496.bz2):
> Terminated
> Sep 08 15:15:07 [23862] straddie3       crmd:  warning:
> te_graph_trigger:       Transition failed: terminated
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice: print_graph:
>    Graph 48 with 16 actions: batch-limit=0 jobs, network-
> delay=6ms
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   14]: Completed rsc op main1_stop_0        
>              on straddie1 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   15]: Completed rsc op main4_stop_0        
>              on straddie1 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   16]: Pending rsc op backup2_stop_0        
>              on straddie1 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:   * [Input 31]: Unresolved dependency rsc op
> main2_start_0
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   17]: Pending rsc op backup3_stop_0        
>              on straddie1 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:   * [Input 32]: Unresolved dependency rsc op
> main3_start_0
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   18]: Pending rsc op backup5_stop_0        
>              on straddie1 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:   * [Input 34]: Unresolved dependency rsc op
> main5_start_0
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  

Re: [ClusterLabs] High CPU during CIB sync

2017-09-15 Thread Ken Gaillot
On Mon, 2017-09-11 at 16:02 +0530, Anu Pillai wrote:
> Hi,
> 
> We are using 3 node cluster (2 active and 1 standby). 
> When failover happens, CPU utilization going high in newly active
> node as well as other active node. It is remaining in high CPU state
> for nearly 20 seconds.
> 
> We have 122 resource attributes under the resource(res1) which is
> failing over. Failover triggered at 14:49:05 
> 
> Cluster Information:
> Pacemaker 1.1.14
> Corosync Cluster Engine, version '2.3.5'
> pcs version 0.9.150
> dc-version: 1.1.14-5a6cdd1
> no-quorum-policy: ignore
> notification-agent: /etc/sysconfig/notify.sh
> notification-recipient: /var/log/notify.log
> placement-strategy: balanced
> startup-fencing: true
> stonith-enabled: false
> 
> Our device is having 8 cores. Pacemaker and related application
> running on Core 6
> 
> top command output:
> CPU0:  4.4% usr 17.3% sys  0.0% nic 75.7% idle  0.0% io  1.9% irq
>  0.4% sirq
> CPU1:  9.5% usr  2.5% sys  0.0% nic 88.0% idle  0.0% io  0.0% irq
>  0.0% sirq
> CPU2:  1.4% usr  1.4% sys  0.0% nic 96.5% idle  0.0% io  0.4% irq
>  0.0% sirq
> CPU3:  3.4% usr  0.4% sys  0.0% nic 95.5% idle  0.0% io  0.4% irq
>  0.0% sirq
> CPU4:  7.9% usr  2.4% sys  0.0% nic 88.5% idle  0.0% io  0.9% irq
>  0.0% sirq
> CPU5:  0.5% usr  0.5% sys  0.0% nic 98.5% idle  0.0% io  0.5% irq
>  0.0% sirq
> CPU6: 60.3% usr 38.6% sys  0.0% nic  0.0% idle  0.0% io  0.4% irq
>  0.4% sirq
> CPU7:  2.9% usr 10.3% sys  0.0% nic 83.6% idle  0.0% io  2.9% irq
>  0.0% sirq
> Load average: 3.47 1.82 1.63 7/314 11444
>  
>  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
>  4921  4839 hacluste R <  78492  2.8   6  2.0
> /usr/libexec/pacemaker/cib
> 11240 11239 root       RW<      0  0.0   6  1.9 [python]
>  4925  4839 hacluste R <  52804  1.9   6  1.1
> /usr/libexec/pacemaker/pengine
>  4637     1 root           R <  97620  3.5   6  0.4 corosync -p -f
>  4926  4839 hacluste S <   131m  4.8   6  0.3
> /usr/libexec/pacemaker/crmd
>  4839     1 root           S <  33448  1.2   6  0.1 pacemakerd
> 
> 
> 
> I am attaching the log for your reference.
> 
> 
> 
> Regards,
> Aswathi

Is there a reason all the cluster services are pegged to one core?
Pacemaker can take advantage of multiple cores both by spreading out
the daemons and by running multiple resource actions at once.

I see you're using the original "notifications" implementation. This
has been superseded by "alerts" in Pacemaker 1.1.15 and later. I
recommend upgrading if you can, which will also get you bugfixes in
pacemaker and corosync that could help. In any case, your notify script
/etc/sysconfig/notify.sh is generating errors. If you don't really need
the notify logging, I'd disable that and see if that helps.

It looks to me that, after failover, the resource agent is setting a
lot of node attributes and possibly its own resource attributes. Each
of those changes requires the cluster to recalculate resource
placement, and that's probably where most of the CPU usage is coming
from. (BTW, setting node attributes is fine, but a resource agent
generally shouldn't change its own configuration.)

You should be able to reduce the CPU usage by setting "dampening" on
the node attributes. This will make the cluster wait a bit of time
before writing node attribute changes to the CIB, so the recalculation
doesn't have to occur immediately after each change. See the "--delay"
option to attrd_updater (which can be used when creating the attribute
initially).

-- 
Ken Gaillot 




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Force stopping the resources from a resource group in parallel

2017-09-15 Thread Ken Gaillot
On Tue, 2017-09-12 at 10:49 +0200, John Gogu wrote:
> Hello,
> I have created a resource group from 2 resources: pcs resource group
> add Group1 IPaddr Email. From the documentation is clear that
> resources are stopped in the reverse order in which are specified
> (Email first, then IPaddr).
> 
> There is a way to force stopping of the resources from a resource
> group (Group1) in parallel?

A group is essentially a shorthand for ordering+colocation, so the
ordering is always enforced.

Instead of a group, you could create just a colocation constraint,
which allows the resources to start and stop in any order (or
simultaneously), but always on the same node.

If you need them to always start in order, but stopping can be done in
any order (or simultaneously), then use a colocation constraint plus an
ordering constraint with symmetrical=false.

> 
> 
> Mit freundlichen Grüßen/Kind regards,
> 
> John Gogu
> Skype: ionut.gogu
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to avoid stopping ordered resources on cleanup?

2017-09-15 Thread Ken Gaillot
On Fri, 2017-09-15 at 15:23 +, CART Andreas wrote:
> Hello
>  
> I think there is a more general misunderstanding on my side.
> I reproduced the “problem” with a very simple test-cluster containing
> just 2 dummy resources:
>  
> Pacemaker Nodes:
> deneb682 deneb683
>  
> Resources:
> Resource: Res1 (class=ocf provider=pacemaker type=Dummy)
>   Operations: start interval=0s timeout=20 (Res1-start-interval-0s)
>   stop interval=0s timeout=20 (Res1-stop-interval-0s)
>   monitor interval=10 timeout=20 (Res1-monitor-interval-
> 10)
> Resource: Res2 (class=ocf provider=pacemaker type=Dummy)
>   Operations: start interval=0s timeout=20 (Res2-start-interval-0s)
>   stop interval=0s timeout=20 (Res2-stop-interval-0s)
>   monitor interval=10 timeout=20 (Res2-monitor-interval-
> 10)
>  
> Ordering Constraints:
>   start Res1 then start Res2 (kind:Mandatory) (id:order-Res1-Res2-
> mandatory)
>  
> Cluster Properties:
> cluster-infrastructure: cman
> default-resource-stickiness: 100
> no-quorum-policy: ignore
> symmetric-cluster: true
>  
> When I call “pcs resource cleanup Res1” this will result in an
> interruption of service at the side of Res2 (i.e. stop Res2 …)
> My – unconfirmed – assumption was, that pacemaker would first detect
> the current state of the resource(s) by calling monitor and then
> decide if there are any actions to be performed.
> But from reading the logfiles I would interpret that Res1 is
> temporarily removed from the cib and re-inserted again. And this
> results in stopping Res2 until Res1 has confirmed state “started”.

Correct, removing the resource's operation history is how pacemaker
triggers a re-probe of the current status.
 
> As I interpret the documentation it would be possible to avoid this
> behaviour by configuring the order constraint with kind=Optional.
> But I am not sure if this would result in any other undeserved side
> effects. (e.g on reverse order when stopping)

kind=Optional constraints only apply when both actions need to be done
in the same transition. I.e. if a single cluster check finds that both
Res1 and Res2 need to be started, Res1 will be started before Res2. But
it is entirely possible that Res2 can be started in an earlier
transition, with Res1 still stopped, and a later transition starts
Res1. Similarly when stopping, Res2 will be stopped first, if both need
to be stopped.

In your original scenario, if your master/slave resource will only bind
to the IP after it is up, kind=Optional won't be reliable. But if the
master/slave resource binds to the wildcard IP, then the order really
doesn't matter -- you could keep the colocation constraint and drop the
ordering.
 
> Another work-around seems to be setting the dependent resource to
> unmanaged, perform the cleanup and then set it back to managed.

This is what i would recommend, if you have to keep the mandatory
ordering.

> And I wonder if “pcs resource failcount reset” would do the trick
> WITHOUT any actions being performed if no change in state is
> necessary.
> But I think to remember that we already tried this now and then and
> sometimes such a failed resource was not started after the failcount
> reset.  (But I am not sure and had not yet time to try to reproduce.)

No, in newer pacemaker versions, crm_failcount --delete is equivalent
to a crm_resource --cleanup. (pcs calls these to actually perform the
work)

> Is there any deeper insight which might help with a sound
> understanding of this issue?

It's a side effect of the current CIB implementation. Pacemaker's
policy engine determines the current state of a resource by checking
its operation history in the CIB. Cleanups remove the operation
history, thus making the current state unknown, forcing a re-probe. As
a side effect, any dependencies no longer have their constraints
satisfied until the re-probe completes.

It would be theoretically possible to implement a "cleanup old
failures" option that would clear a resource's fail count and remove
only its operation history entries for failed operations, as long as
doing so does not change the current state determination. But that
would be quite complicated, and setting the resource unmanaged is an
easy workaround.

> Kind regards
> Andreas Cart
> From: Klaus Wenninger [mailto:kwenn...@redhat.com] 
> Sent: Mittwoch, 13. September 2017 13:33
> To: Cluster Labs - All topics related to open-source clustering
> welcomed ; CART Andreas  at>
> Subject: Re: [ClusterLabs] How to avoid stopping ordered resources on
> cleanup?
>  
> On 09/13/2017 10:26 AM, CART Andreas wrote:
> Hello
>  
> We have a basic 2 node active/passive cluster with Pacemaker (1.1.16
> , pcs: 0.9.148) / CMAN (3.0.12.1) / Corosync (1.4.7) on RHEL 6.8.
>  
> On the occasion of testing the cluster we noticed that dependent
> resources are stopped when calling cleanup for a resource lower down
> the order chain.
> But for the production