[ClusterLabs] Q: late stop of dependency?

2016-11-17 Thread Ulrich Windl
Hi!

I have a question:
When having dependencies like "A has to start before B" and "A has to start 
before C". Now when shutting down, B and C are shut down before A, as requested.
Now when B takes a long time to stop, C is stopped early. I wonder whether I 
can tell pacemaker to stop C late, but I don't want to add a dependency like "C 
has to stop before B" (I don't want a functional dependency between B and C, 
because there is none).
If that's too abstract: Think A is a resource B needs, and C is a performance 
monitor for A.

Regards,
Ulrich




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Q: late stop of dependency?

2016-11-17 Thread Ken Gaillot
On 11/17/2016 02:46 AM, Ulrich Windl wrote:
> Hi!
> 
> I have a question:
> When having dependencies like "A has to start before B" and "A has to start 
> before C". Now when shutting down, B and C are shut down before A, as 
> requested.
> Now when B takes a long time to stop, C is stopped early. I wonder whether I 
> can tell pacemaker to stop C late, but I don't want to add a dependency like 
> "C has to stop before B" (I don't want a functional dependency between B and 
> C, because there is none).
> If that's too abstract: Think A is a resource B needs, and C is a performance 
> monitor for A.
> 
> Regards,
> Ulrich

I'd make "stop B then stop C" an asymmetrical, optional constraint.

asymmetrical = it only applies when stopping, not starting

optional = only apply the order if they happen to be stopping in the
same transition

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Locate resource with functioning member of clone set?

2016-11-17 Thread Israel Brewster
I have a resource that is set up as a clone set across my cluster, partly for pseudo-load balancing (If someone wants to perform an action that will take a lot of resources, I can have them do it on a different node than the primary one), but also simply because the resource can take several seconds to start, and by having it already running as a clone set, I can failover in the time it takes to move an IP resource - essentially zero down time.This is all well and good, but I ran into a problem the other day where the process on one of the nodes stopped working properly. Pacemaker caught the issue, and tried to fix it by restarting the resource, but was unable to because the old instance hadn't actually exited completely and was still tying up the TCP port, thereby preventing the new instance that pacemaker launched from being able to start.So this leaves me with two questions: 1) is there a way to set up a "kill script", such that before trying to launch a new copy of a process, pacemaker will run this script, which would be responsible for making sure that there are no other instances of the process running?2) Even in the above situation, where pacemaker couldn't launch a good copy of the resource on the one node, the situation could have been easily "resolved" by pacemaker moving the virtual IP resource to another node where the cloned resource was running correctly, and notifying me of the problem. I know how to make colocation constraints in general, but how do I do a colocation constraint with a cloned resource where I just need the virtual IP running on *any* node where there clone is working properly? Or is it the same as any other colocation resource, and pacemaker is simply smart enough to both try to restart the failed resource and move the virtual IP resource at the same time?As an addendum to question 2, I'd be interested in any methods there may be to be notified of changes in the cluster state, specifically things like when a resource fails on a node - my current nagios/icinga setup doesn't catch that when pacemaker properly moves the resource to a different node, because the resource remains up (which, of course, is the whole point), but it would still be good to know something happened so I could look into it and see if something needs fixed on the failed node to allow the resource to run there properly.Thanks!
---Israel BrewsterSystems Analyst IIRavn Alaska5245 Airport Industrial RdFairbanks, AK 99709(907) 450-7293---BEGIN:VCARD
VERSION:3.0
N:Brewster;Israel;;;
FN:Israel Brewster
ORG:Frontier Flying Service;MIS
TITLE:PC Support Tech II
EMAIL;type=INTERNET;type=WORK;type=pref:isr...@frontierflying.com
TEL;type=WORK;type=pref:907-450-7293
item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701;
item1.X-ABADR:us
CATEGORIES:General
X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson
END:VCARD


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Bug in ocf-shellfuncs, ocf_local_nodename function?

2016-11-17 Thread Israel Brewster
This refers specifically to build version 5434e9646462d2c3c8f7aad2609d0ef1875839c7 of the ocf-shellfuncs file, on CentOS 6.8, so it might not be an issue on later builds (if any) or different operating systems, but it would appear that the ocf_local_nodename function can have issues with certain configurations. Specially, I was debugging an issue I was having with a resource agent that I traced down to that function returning the FQDN of the machine rather than the actual node name, which in my case was a short name.In looking at the code, I see that the function is looking for a pacemaker version greater than 1.1.8, in which case it uses crm_node (which works), otherwise it just uses "uname -n", which returns the FQDN (at least in my configuration). To get the current version, it runs the command:local version=$(pacemakerd -$ | grep "Pacemaker .*" | awk '{ print $2 }')Which on CentOS 6.8 returns (as of today, at least):1.1.14-8.el6_8.1Unfortunately, when that string is passed to the ocf_version_cmp function to compare against 1.1.8, it returns 3, for "bad format", and so falls back to using "uname -n", even though the version *is* greater than 1.1.8, and crm_node would return the proper value.Of course, if you always set up your cluster to use the FQDN of the servers as the node name, or more specifically always set them up such that the output of uname -n is the node name, then there isn't an issue other than perhaps a undetectably slight loss of efficiency. However, as I accidentally proved by doing otherwise, there is no actual requirement when setting up a cluster that the node names match uname -n (although perhaps it is considered "best practice"?), as long as they resolve to an IP.I've worked around this in my installation by simply modifying the resource agent to call crm_node directly (since I know I am running on a version greater than 1.1.8), but I figured I might mention it, since I don't get any results when trying to google the issue.
---Israel BrewsterSystems Analyst IIRavn Alaska5245 Airport Industrial RdFairbanks, AK 99709(907) 450-7293---BEGIN:VCARD
VERSION:3.0
N:Brewster;Israel;;;
FN:Israel Brewster
ORG:Frontier Flying Service;MIS
TITLE:PC Support Tech II
EMAIL;type=INTERNET;type=WORK;type=pref:isr...@frontierflying.com
TEL;type=WORK;type=pref:907-450-7293
item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701;
item1.X-ABADR:us
CATEGORIES:General
X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson
END:VCARD


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Query about resource stickiness

2016-11-17 Thread phanidhar prattipati
Good Morning All,

I have configured HA on 3 nodes and in order to disable automatic fail over
i need to set resource stickiness value and not sure how to calculate it.
Currently i set it o INFINITY which i believe is not the right way of doing
it. Any pointers how to calculate based on the environment set up will be
really great help?


-- 
Thanks,
Phanidhar
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Locate resource with functioning member of clone set?

2016-11-17 Thread Ken Gaillot
On 11/17/2016 11:37 AM, Israel Brewster wrote:
> I have a resource that is set up as a clone set across my cluster,
> partly for pseudo-load balancing (If someone wants to perform an action
> that will take a lot of resources, I can have them do it on a different
> node than the primary one), but also simply because the resource can
> take several seconds to start, and by having it already running as a
> clone set, I can failover in the time it takes to move an IP resource -
> essentially zero down time.
> 
> This is all well and good, but I ran into a problem the other day where
> the process on one of the nodes stopped working properly. Pacemaker
> caught the issue, and tried to fix it by restarting the resource, but
> was unable to because the old instance hadn't actually exited completely
> and was still tying up the TCP port, thereby preventing the new instance
> that pacemaker launched from being able to start.
> 
> So this leaves me with two questions: 
> 
> 1) is there a way to set up a "kill script", such that before trying to
> launch a new copy of a process, pacemaker will run this script, which
> would be responsible for making sure that there are no other instances
> of the process running?

Sure, it's called a resource agent :)

When recovering a failed resource, Pacemaker will call the resource
agent's stop action first, then start. The stop should make sure the
service has exited completely. If it doesn't, the agent should be fixed
to do so.

> 2) Even in the above situation, where pacemaker couldn't launch a good
> copy of the resource on the one node, the situation could have been
> easily "resolved" by pacemaker moving the virtual IP resource to another
> node where the cloned resource was running correctly, and notifying me
> of the problem. I know how to make colocation constraints in general,
> but how do I do a colocation constraint with a cloned resource where I
> just need the virtual IP running on *any* node where there clone is
> working properly? Or is it the same as any other colocation resource,
> and pacemaker is simply smart enough to both try to restart the failed
> resource and move the virtual IP resource at the same time?

Correct, a simple colocation constraint of "resource R with clone C"
will make sure R runs with a working instance of C.

There is a catch: if *any* instance of C restarts, R will also restart
(even if it stays in the same place), because it depends on the clone as
a whole. Also, in the case you described, pacemaker would first try to
restart both C and R on the same node, rather than move R to another
node (although you could set on-fail=stop on C to force R to move).

If that's not sufficient, you could try some magic with node attributes
and rules. The new ocf:pacemaker:attribute resource in 1.1.16 could help
there.

> As an addendum to question 2, I'd be interested in any methods there may
> be to be notified of changes in the cluster state, specifically things
> like when a resource fails on a node - my current nagios/icinga setup
> doesn't catch that when pacemaker properly moves the resource to a
> different node, because the resource remains up (which, of course, is
> the whole point), but it would still be good to know something happened
> so I could look into it and see if something needs fixed on the failed
> node to allow the resource to run there properly.

Since 1.1.15, Pacemaker has alerts:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm47975782138832

Before 1.1.15, you can use the ocf:pacemaker:ClusterMon resource to do
something similar.

> 
> Thanks!
> ---
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7293
> ---

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Bug in ocf-shellfuncs, ocf_local_nodename function?

2016-11-17 Thread Ken Gaillot
On 11/17/2016 11:59 AM, Israel Brewster wrote:
> This refers specifically to build version
> 5434e9646462d2c3c8f7aad2609d0ef1875839c7 of the ocf-shellfuncs file, on
> CentOS 6.8, so it might not be an issue on later builds (if any) or
> different operating systems, but it would appear that the
> ocf_local_nodename function can have issues with certain configurations.
> Specially, I was debugging an issue I was having with a resource agent
> that I traced down to that function returning the FQDN of the machine
> rather than the actual node name, which in my case was a short name.
> 
> In looking at the code, I see that the function is looking for a
> pacemaker version greater than 1.1.8, in which case it uses crm_node
> (which works), otherwise it just uses "uname -n", which returns the FQDN
> (at least in my configuration). To get the current version, it runs the
> command:
> 
> local version=$(pacemakerd -$ | grep "Pacemaker .*" | awk '{ print $2 }')
> 
> Which on CentOS 6.8 returns (as of today, at least):
> 
> 1.1.14-8.el6_8.1
> 
> Unfortunately, when that string is passed to the ocf_version_cmp
> function to compare against 1.1.8, it returns 3, for "bad format", and
> so falls back to using "uname -n", even though the version *is* greater
> than 1.1.8, and crm_node would return the proper value.
> 
> Of course, if you always set up your cluster to use the FQDN of the
> servers as the node name, or more specifically always set them up such
> that the output of uname -n is the node name, then there isn't an issue
> other than perhaps a undetectably slight loss of efficiency. However, as
> I accidentally proved by doing otherwise, there is no actual requirement
> when setting up a cluster that the node names match uname -n (although
> perhaps it is considered "best practice"?), as long as they resolve to
> an IP.

Yes, it is considered a "best practice" (or at least a "safer
practice"), because issues like this tend to pop up periodically. :(

I'd recommend filing a bug against the resource-agents package, so the
version comparison can be made more intelligent.

> 
> I've worked around this in my installation by simply modifying the
> resource agent to call crm_node directly (since I know I am running on a
> version greater than 1.1.8), but I figured I might mention it, since I
> don't get any results when trying to google the issue.
> ---
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7293
> ---

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Query about resource stickiness

2016-11-17 Thread Ken Gaillot
On 11/17/2016 06:41 PM, phanidhar prattipati wrote:
> Good Morning All,
> 
> I have configured HA on 3 nodes and in order to disable automatic fail
> over i need to set resource stickiness value and not sure how to
> calculate it. Currently i set it o INFINITY which i believe is not the
> right way of doing it. Any pointers how to calculate based on the
> environment set up will be really great help?
> 
> 
> -- 
> Thanks,
> Phanidhar

INFINITY is fine -- many people use that for stickiness.

It's simply a matter of preference. How it matters is when weighing
against other scores in your configuration.

For example, let's say you have a resource R with a location constraint
preferring node N, and you have resource stickiness.

If N crashes or is shut down, R will move to another node. When N comes
back, R will stay on the other node, if the resource stickiness is
higher than the location constraint's score; it will move back to N, if
the location constraint's score is higher.

A score of INFINITY means never move back, as long as the new node stays up.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Locate resource with functioning member of clone set?

2016-11-17 Thread Ulrich Windl
>>> Israel Brewster  schrieb am 17.11.2016 um 18:37 in
Nachricht <751f1bd6-8434-4ad9-b77f-10eddfe28...@ravnalaska.net>:
> I have a resource that is set up as a clone set across my cluster, partly for 
> pseudo-load balancing (If someone wants to perform an action that will take a 
> lot of resources, I can have them do it on a different node than the primary 
> one), but also simply because the resource can take several seconds to start, 
> and by having it already running as a clone set, I can failover in the time 
> it takes to move an IP resource - essentially zero down time.
> 
> This is all well and good, but I ran into a problem the other day where the 
> process on one of the nodes stopped working properly. Pacemaker caught the 
> issue, and tried to fix it by restarting the resource, but was unable to 
> because the old instance hadn't actually exited completely and was still 
> tying up the TCP port, thereby preventing the new instance that pacemaker 
> launched from being able to start.
> 
> So this leaves me with two questions: 
> 
> 1) is there a way to set up a "kill script", such that before trying to 
> launch a new copy of a process, pacemaker will run this script, which would 
> be responsible for making sure that there are no other instances of the 
> process running?
> 2) Even in the above situation, where pacemaker couldn't launch a good copy 
> of the resource on the one node, the situation could have been easily 
> "resolved" by pacemaker moving the virtual IP resource to another node where 
> the cloned resource was running correctly, and notifying me of the problem. I 
> know how to make colocation constraints in general, but how do I do a 
> colocation constraint with a cloned resource where I just need the virtual IP 
> running on *any* node where there clone is working properly? Or is it the 
> same as any other colocation resource, and pacemaker is simply smart enough 
> to both try to restart the failed resource and move the virtual IP resource 
> at the same time?

I wonder: Wouldn't a monitor operation that reports the resource as running as 
long as the port is occupied resolve both issues?

> 
> As an addendum to question 2, I'd be interested in any methods there may be 
> to be notified of changes in the cluster state, specifically things like when 
> a resource fails on a node - my current nagios/icinga setup doesn't catch 
> that 
> when pacemaker properly moves the resource to a different node, because the 
> resource remains up (which, of course, is the whole point), but it would 
> still be good to know something happened so I could look into it and see if 
> something needs fixed on the failed node to allow the resource to run there 
> properly.
> 
> Thanks!
> ---
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7293
> ---





___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org