Re: [ClusterLabs] Passing and binding to virtual IP in my service

2016-01-08 Thread Solutions Solutions
hi Nikhil,
 can you send me the N+1 redundancy configuration file,which you posted
earlier.

On Thu, Jan 7, 2016 at 2:58 PM, Nikhil Utane 
wrote:

> Hi,
>
> I have my cluster up and running just fine. I have a dummy service that
> sends UDP packets out to another host.
>
>  Resource Group: MyGroup
>  ClusterIP  (ocf::heartbeat:IPaddr2):   Started node1
>  UDPSend(ocf::nikhil:UDPSend):  Started node1
>
> If I ping to the virtual IP from outside, the response goes via virtual IP.
> But if I initiate ping from node1, then it takes the actual (non-virtual
> IP). This is expected since I am not binding to the vip. (ping -I vip works
> fine).
> So my question is, how to pass the virtual IP to my UDPSend OCF agent so
> that it can then bind to the vip? This will ensure that all messages
> initiated by my UDPSend goes from vip.
>
> Out of curiosity, where is this virtual IP stored in the kernel?
> I expected to see a secondary interface ( for e.g. eth0:1) with the vip
> but it isn't there.
>
> -Thanks
> Nikhil
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Q] Pacemaker: Kamailio resource agent

2016-01-08 Thread Ken Gaillot
On 12/26/2015 05:27 AM, Sebish wrote:
> Hello to all ha users,
> 
> first of all thanks for you work @ mailinglist, pacemaker and ras!
> 
> I have an issue with the kamailio resource agent
> 
> (ra) and it would be great, if you could help me a little.

I'm not familiar with kamailio, but I can make some general comments ...

> -- 
> _Status:
> 
> _Debian 7.9
> Kamailio - running
> Heartbeat & Pacemaker - running (incl. running virtual IP and apache ra)
> and more
> 
> _What I did__:_
> 
>  * Create /usr/lib/ocf/resource.d/heartbeat/kamailio and chmod 755'd it
>  * Then I inserted the code of the ra and changed the following:
>  o RESKEY_kamuser_default="*myuser*"

It's not necessary to change the defaults in the code; when you create
the resource configuration in the cluster, you can specify options (such
as "kamuser=*myuser*") to override the defaults.

>  o Line 52 to:
>RESKEY_pidfile_default="/var/run/kamailio/kamailio.pid(This is
>in my kamctlrc file too, exists and works)
>  o Line 53 to: RESKEY_monitoring_ip_default=*IPOFMYKAMAILIOSERVER*
>  o Changed : ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat} ->
>/usr/lib/ocf/lib/heartbeat , because he did not find it

This shouldn't be necessary; pacemaker should set the OCF_ROOT
environment variable before calling the agent. If you were having
trouble testing it from the command line, simply set
OCF_ROOT=/usr/lib/ocf before calling it.

>  o Changed html snippet  to &&

I'm not sure what you mean here. The example given in the agent's XML
metadata should stay as  since it's XML and may not parse
correctly otherwise. If you're talking about your kamailio.cfg, then
yes, you should use && there.

>  o listen_address:*virtualipofkamailioserver*
>  o (For more see attachment)
> 
>  * Installed sipsak
> 
> _What I get:_
> 
> crm status gives me: STOPPED - Kamailio_start_0 (node=node1, call=22,
> rc=-2, status=Timed Out): unknown exec error (on all nodes)

This means that pacemaker tried to call the "start" action of the
resource agent, but it timed out on every node. It's possible the start
action isn't working, or that the timeout is too short. You can set the
timeout by defining a start operation for the resource in the cluster
configuration, with a timeout= option.

> _
> What I need:_
> 
>  * In the ra at line 155 it says to insert a code snippet to the
>kamailio.cfg, but not where exactly.
>  o Please tell me, at which spot exactly I have to insert it. (I
>pasted it at line ~582, # Handle requests within SIP dialogs)
> 
>  * Is there a way to debug the kamailio ra, if inserting the code
>snipped using your help will not be enough?

Any output from resource agents should be in the system log and/or
pacemaker.log. That's a good place to start.

There are also tools such as ocf-tester and ocft to test resource agents
from the command line (though they're not always made available in
packages).

> 
> Thank you very mich for your time and interest!
> 
> Sebastian


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Help required for N+1 redundancy setup

2016-01-08 Thread Ken Gaillot
On 01/08/2016 11:13 AM, Nikhil Utane wrote:
>> I think stickiness will do what you want here. Set a stickiness higher
>> than the original node's preference, and the resource will want to stay
>> where it is.
> 
> Not sure I understand this. Stickiness will ensure that resources don't
> move back when original node comes back up, isn't it?
> But in my case, I want the newly standby node to become the backup node for
> all other nodes. i.e. it should now be able to run all my resource groups
> albeit with a lower score. How do I achieve that?

Oh right. I forgot to ask whether you had an opt-out
(symmetric-cluster=true, the default) or opt-in
(symmetric-cluster=false) cluster. If you're opt-out, every node can run
every resource unless you give it a negative preference.

Partly it depends on whether there is a good reason to give each
instance a "home" node. Often, there's not. If you just want to balance
resources across nodes, the cluster will do that by default.

If you prefer to put certain resources on certain nodes because the
resources require more physical resources (RAM/CPU/whatever), you can
set node attributes for that and use rules to set node preferences.

Either way, you can decide whether you want stickiness with it.

> Also can you answer, how to get the values of node that goes active and the
> node that goes down inside the OCF agent?  Do I need to use notification or
> some simpler alternative is available?
> Thanks.
> 
> 
> On Fri, Jan 8, 2016 at 9:30 PM, Ken Gaillot  wrote:
> 
>> On 01/08/2016 06:55 AM, Nikhil Utane wrote:
>>> Would like to validate my final config.
>>>
>>> As I mentioned earlier, I will be having (upto) 5 active servers and 1
>>> standby server.
>>> The standby server should take up the role of active that went down. Each
>>> active has some unique configuration that needs to be preserved.
>>>
>>> 1) So I will create total 5 groups. Each group has a "heartbeat::IPaddr2
>>> resource (for virtual IP) and my custom resource.
>>> 2) The virtual IP needs to be read inside my custom OCF agent, so I will
>>> make use of attribute reference and point to the value of IPaddr2 inside
>> my
>>> custom resource to avoid duplication.
>>> 3) I will then configure location constraint to run the group resource
>> on 5
>>> active nodes with higher score and lesser score on standby.
>>> For e.g.
>>> Group  NodeScore
>>> -
>>> MyGroup1node1   500
>>> MyGroup1node6   0
>>>
>>> MyGroup2node2   500
>>> MyGroup2node6   0
>>> ..
>>> MyGroup5node5   500
>>> MyGroup5node6   0
>>>
>>> 4) Now if say node1 were to go down, then stop action on node1 will first
>>> get called. Haven't decided if I need to do anything specific here.
>>
>> To clarify, if node1 goes down intentionally (e.g. standby or stop),
>> then all resources on it will be stopped first. But if node1 becomes
>> unavailable (e.g. crash or communication outage), it will get fenced.
>>
>>> 5) But when the start action of node 6 gets called, then using crm
>> command
>>> line interface, I will modify the above config to swap node 1 and node 6.
>>> i.e.
>>> MyGroup1node6   500
>>> MyGroup1node1   0
>>>
>>> MyGroup2node2   500
>>> MyGroup2node1   0
>>>
>>> 6) To do the above, I need the newly active and newly standby node names
>> to
>>> be passed to my start action. What's the best way to get this information
>>> inside my OCF agent?
>>
>> Modifying the configuration from within an agent is dangerous -- too
>> much potential for feedback loops between pacemaker and the agent.
>>
>> I think stickiness will do what you want here. Set a stickiness higher
>> than the original node's preference, and the resource will want to stay
>> where it is.
>>
>>> 7) Apart from node name, there will be other information which I plan to
>>> pass by making use of node attributes. What's the best way to get this
>>> information inside my OCF agent? Use crm command to query?
>>
>> Any of the command-line interfaces for doing so should be fine, but I'd
>> recommend using one of the lower-level tools (crm_attribute or
>> attrd_updater) so you don't have a dependency on a higher-level tool
>> that may not always be installed.
>>
>>> Thank You.
>>>
>>> On Tue, Dec 22, 2015 at 9:44 PM, Nikhil Utane <
>> nikhil.subscri...@gmail.com>
>>> wrote:
>>>
 Thanks to you Ken for giving all the pointers.
 Yes, I can use service start/stop which should be a lot simpler. Thanks
 again. :)

 On Tue, Dec 22, 2015 at 9:29 PM, Ken Gaillot 
>> wrote:

> On 12/22/2015 12:17 AM, Nikhil Utane wrote:
>> I have prepared a write-up explaining my requirements and current
> solution
>> that I am proposing based on my understanding so far.
>> Kindly let me know if what I am 

Re: [ClusterLabs] Help required for N+1 redundancy setup

2016-01-08 Thread Nikhil Utane
Would like to validate my final config.

As I mentioned earlier, I will be having (upto) 5 active servers and 1
standby server.
The standby server should take up the role of active that went down. Each
active has some unique configuration that needs to be preserved.

1) So I will create total 5 groups. Each group has a "heartbeat::IPaddr2
resource (for virtual IP) and my custom resource.
2) The virtual IP needs to be read inside my custom OCF agent, so I will
make use of attribute reference and point to the value of IPaddr2 inside my
custom resource to avoid duplication.
3) I will then configure location constraint to run the group resource on 5
active nodes with higher score and lesser score on standby.
For e.g.
Group  NodeScore
-
MyGroup1node1   500
MyGroup1node6   0

MyGroup2node2   500
MyGroup2node6   0
..
MyGroup5node5   500
MyGroup5node6   0

4) Now if say node1 were to go down, then stop action on node1 will first
get called. Haven't decided if I need to do anything specific here.
5) But when the start action of node 6 gets called, then using crm command
line interface, I will modify the above config to swap node 1 and node 6.
i.e.
MyGroup1node6   500
MyGroup1node1   0

MyGroup2node2   500
MyGroup2node1   0

6) To do the above, I need the newly active and newly standby node names to
be passed to my start action. What's the best way to get this information
inside my OCF agent?
7) Apart from node name, there will be other information which I plan to
pass by making use of node attributes. What's the best way to get this
information inside my OCF agent? Use crm command to query?

Thank You.

On Tue, Dec 22, 2015 at 9:44 PM, Nikhil Utane 
wrote:

> Thanks to you Ken for giving all the pointers.
> Yes, I can use service start/stop which should be a lot simpler. Thanks
> again. :)
>
> On Tue, Dec 22, 2015 at 9:29 PM, Ken Gaillot  wrote:
>
>> On 12/22/2015 12:17 AM, Nikhil Utane wrote:
>> > I have prepared a write-up explaining my requirements and current
>> solution
>> > that I am proposing based on my understanding so far.
>> > Kindly let me know if what I am proposing is good or there is a better
>> way
>> > to achieve the same.
>> >
>> >
>> https://drive.google.com/file/d/0B0zPvL-Tp-JSTEJpcUFTanhsNzQ/view?usp=sharing
>> >
>> > Let me know if you face any issue in accessing the above link. Thanks.
>>
>> This looks great. Very well thought-out.
>>
>> One comment:
>>
>> "8. In the event of any failover, the standby node will get notified
>> through an event and it will execute a script that will read the
>> configuration specific to the node that went down (again using
>> crm_attribute) and become active."
>>
>> It may not be necessary to use the notifications for this. Pacemaker
>> will call your resource agent with the "start" action on the standby
>> node, after ensuring it is stopped on the previous node. Hopefully the
>> resource agent's start action has (or can have, with configuration
>> options) all the information you need.
>>
>> If you do end up needing notifications, be aware that the feature will
>> be disabled by default in the 1.1.14 release, because changes in syntax
>> are expected in further development. You can define a compile-time
>> constant to enable them.
>>
>>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Passing and binding to virtual IP in my service

2016-01-08 Thread Nikhil Utane
What I had posted earlier was an approach to do N+1 Redundancy for my
use-case (which could be different from yours).
Attaching the same and the cib xml to this thread (Don't know if
attachments are allowed.)
There are some follow-up questions that I am posting on my other thread.
Please check that.

On Fri, Jan 8, 2016 at 1:41 PM, Solutions Solutions 
wrote:

> hi Nikhil,
>  can you send me the N+1 redundancy configuration file,which you posted
> earlier.
>
> On Thu, Jan 7, 2016 at 2:58 PM, Nikhil Utane 
> wrote:
>
>> Hi,
>>
>> I have my cluster up and running just fine. I have a dummy service that
>> sends UDP packets out to another host.
>>
>>  Resource Group: MyGroup
>>  ClusterIP  (ocf::heartbeat:IPaddr2):   Started node1
>>  UDPSend(ocf::nikhil:UDPSend):  Started node1
>>
>> If I ping to the virtual IP from outside, the response goes via virtual
>> IP.
>> But if I initiate ping from node1, then it takes the actual (non-virtual
>> IP). This is expected since I am not binding to the vip. (ping -I vip works
>> fine).
>> So my question is, how to pass the virtual IP to my UDPSend OCF agent so
>> that it can then bind to the vip? This will ensure that all messages
>> initiated by my UDPSend goes from vip.
>>
>> Out of curiosity, where is this virtual IP stored in the kernel?
>> I expected to see a secondary interface ( for e.g. eth0:1) with the vip
>> but it isn't there.
>>
>> -Thanks
>> Nikhil
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>


Redundancy using Pacemaker & Corosync-External.docx
Description: MS-Word 2007 document

  

  






  


  

  
  

  
  

  
  
  

  
  


  

  


  
  

  


  


  
  



  


  

http://localhost/server-status"/>
  
  



  

  
  

  


  
  

  


  


  
  



  

  


  
  
  
  


  

  


  

  

  
  

  

  
  

  
  

  

  
  

  
  

  
  

  
  

  

  


  

  
  

  
  

  


  
  


  
  


  
  

  
  

  

  


  

  
  

  
  

  

  
  

  
  

  
  


  
  


  

  

  

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: Early VM resource migration

2016-01-08 Thread Klechomir

Here is what pacemaker says right after node1 comes back after standby:

Dec 16 16:11:41 [4512] CLUSTER-2pengine:debug: 
native_assign_node: All nodes for resource VM_VM1 are unavailable, 
unclean or shutting down (CLUSTER-1: 1, -100)


Dec 16 16:11:41 [4512] CLUSTER-2pengine:debug: 
native_assign_node:  Could not allocate a node for VM_VM1


Dec 16 16:11:41 [4512] CLUSTER-2pengine:debug: 
native_assign_node:  Processing VM_VM1_monitor_1


Dec 16 16:11:41 [4512] CLUSTER-2pengine: info: native_color: 
Resource VM_VM1 cannot run anywhere




VM_VM1 gets immediately stopped as soon as node1 re-appears and stays 
down until its "order/colocation AA resource" comes up on node1.


The curious part is that in the opposite case (node2 comes from 
standby), the failback is ok.


Any ideas?

Regards,

On 17.12.2015 14:51:21 Ulrich Windl wrote:

>>> Klechomir  schrieb am 17.12.2015 um 14:16 in Nachricht

<2102747.TPh6pTdk8c@bobo>:
> Hi Ulrich,
> This is only a part of the config, which concerns the problem.
> Even with dummy resources, the behaviour will be identical, so don't think
> that dlm/clvmd res. config will help solving the problem.

You could send logs with the actual startup sequence then.

> Regards,
> KIecho
>
> On 17.12.2015 08:19:43 Ulrich Windl wrote:
>> >>> Klechomir  schrieb am 16.12.2015 um 17:30 in
>> >>> Nachricht
>>
>> <5671918e.40...@gmail.com>:
>> > On 16.12.2015 17:52, Ken Gaillot wrote:
>> >> On 12/16/2015 02:09 AM, Klechomir wrote:
>> >>> Hi list,
>> >>> I have a cluster with VM resources on a cloned active-active storage.
>> >>>
>> >>> VirtualDomain resource migrates properly during failover (node
>> >>> standby),
>> >>> but tries to migrate back too early, during failback, ignoring the
>> >>> "order" constraint, telling it to start when the cloned storage is
>> >>> up.
>> >>> This causes unnecessary VM restart.
>> >>>
>> >>> Is there any way to make it wait, until its storage resource is up?
>> >>
>> >> Hi Klecho,
>> >>
>> >> If you have an order constraint, the cluster will not try to start the
>> >> VM until the storage resource agent returns success for its start. If
>> >> the storage isn't fully up at that point, then the agent is faulty,
>> >> and
>> >> should be modified to wait until the storage is truly available before
>> >> returning success.
>> >>
>> >> If you post all your constraints, I can look for anything that might
>> >> affect the behavior.
>> >
>> > Thanks for the reply, Ken
>> >
>> > Seems to me that that the constraints for a cloned resources act a a
>> > bit
>> > different.
>> >
>> > Here is my config:
>> >
>> > primitive p_AA_Filesystem_CDrive1 ocf:heartbeat:Filesystem \
>> >
>> >  params device="/dev/CSD_CDrive1/AA_CDrive1"
>> >
>> > directory="/volumes/AA_CDrive1" fstype="ocfs2" options="rw,noatime"
>> > primitive VM_VM1 ocf:heartbeat:VirtualDomain \
>> >
>> >  params config="/volumes/AA_CDrive1/VM_VM1/VM1.xml"
>> >
>> > hypervisor="qemu:///system" migration_transport="tcp" \
>> >
>> >  meta allow-migrate="true" target-role="Started"
>> >
>> > clone AA_Filesystem_CDrive1 p_AA_Filesystem_CDrive1 \
>> >
>> >  meta interleave="true" resource-stickiness="0"
>> >
>> > target-role="Started"
>> > order VM_VM1_after_AA_Filesystem_CDrive1 inf: AA_Filesystem_CDrive1
>> > VM_VM1
>> >
>> > Every time when a node comes back from standby, the VM tries to live
>> > migrate to it long before the filesystem is up.
>>
>> Hi!
>>
>> To me your config looks rather incomplete: What about DLM, O2CB, cLVM,
>> etc.?>>
>> >> ___
>> >> Users mailing list: Users@clusterlabs.org
>> >> http://clusterlabs.org/mailman/listinfo/users
>> >>
>> >> Project Home: http://www.clusterlabs.org
>> >> Getting started:
>> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> Bugs: http://bugs.clusterlabs.org
>> >
>> > ___
>> > Users mailing list: Users@clusterlabs.org
>> > http://clusterlabs.org/mailman/listinfo/users
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org

Re: [ClusterLabs] Help required for N+1 redundancy setup

2016-01-08 Thread Ken Gaillot
On 01/08/2016 06:55 AM, Nikhil Utane wrote:
> Would like to validate my final config.
> 
> As I mentioned earlier, I will be having (upto) 5 active servers and 1
> standby server.
> The standby server should take up the role of active that went down. Each
> active has some unique configuration that needs to be preserved.
> 
> 1) So I will create total 5 groups. Each group has a "heartbeat::IPaddr2
> resource (for virtual IP) and my custom resource.
> 2) The virtual IP needs to be read inside my custom OCF agent, so I will
> make use of attribute reference and point to the value of IPaddr2 inside my
> custom resource to avoid duplication.
> 3) I will then configure location constraint to run the group resource on 5
> active nodes with higher score and lesser score on standby.
> For e.g.
> Group  NodeScore
> -
> MyGroup1node1   500
> MyGroup1node6   0
> 
> MyGroup2node2   500
> MyGroup2node6   0
> ..
> MyGroup5node5   500
> MyGroup5node6   0
> 
> 4) Now if say node1 were to go down, then stop action on node1 will first
> get called. Haven't decided if I need to do anything specific here.

To clarify, if node1 goes down intentionally (e.g. standby or stop),
then all resources on it will be stopped first. But if node1 becomes
unavailable (e.g. crash or communication outage), it will get fenced.

> 5) But when the start action of node 6 gets called, then using crm command
> line interface, I will modify the above config to swap node 1 and node 6.
> i.e.
> MyGroup1node6   500
> MyGroup1node1   0
> 
> MyGroup2node2   500
> MyGroup2node1   0
> 
> 6) To do the above, I need the newly active and newly standby node names to
> be passed to my start action. What's the best way to get this information
> inside my OCF agent?

Modifying the configuration from within an agent is dangerous -- too
much potential for feedback loops between pacemaker and the agent.

I think stickiness will do what you want here. Set a stickiness higher
than the original node's preference, and the resource will want to stay
where it is.

> 7) Apart from node name, there will be other information which I plan to
> pass by making use of node attributes. What's the best way to get this
> information inside my OCF agent? Use crm command to query?

Any of the command-line interfaces for doing so should be fine, but I'd
recommend using one of the lower-level tools (crm_attribute or
attrd_updater) so you don't have a dependency on a higher-level tool
that may not always be installed.

> Thank You.
> 
> On Tue, Dec 22, 2015 at 9:44 PM, Nikhil Utane 
> wrote:
> 
>> Thanks to you Ken for giving all the pointers.
>> Yes, I can use service start/stop which should be a lot simpler. Thanks
>> again. :)
>>
>> On Tue, Dec 22, 2015 at 9:29 PM, Ken Gaillot  wrote:
>>
>>> On 12/22/2015 12:17 AM, Nikhil Utane wrote:
 I have prepared a write-up explaining my requirements and current
>>> solution
 that I am proposing based on my understanding so far.
 Kindly let me know if what I am proposing is good or there is a better
>>> way
 to achieve the same.


>>> https://drive.google.com/file/d/0B0zPvL-Tp-JSTEJpcUFTanhsNzQ/view?usp=sharing

 Let me know if you face any issue in accessing the above link. Thanks.
>>>
>>> This looks great. Very well thought-out.
>>>
>>> One comment:
>>>
>>> "8. In the event of any failover, the standby node will get notified
>>> through an event and it will execute a script that will read the
>>> configuration specific to the node that went down (again using
>>> crm_attribute) and become active."
>>>
>>> It may not be necessary to use the notifications for this. Pacemaker
>>> will call your resource agent with the "start" action on the standby
>>> node, after ensuring it is stopped on the previous node. Hopefully the
>>> resource agent's start action has (or can have, with configuration
>>> options) all the information you need.
>>>
>>> If you do end up needing notifications, be aware that the feature will
>>> be disabled by default in the 1.1.14 release, because changes in syntax
>>> are expected in further development. You can define a compile-time
>>> constant to enable them.
>>>
>>>
> 


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Parallel adding of resources

2016-01-08 Thread Ken Gaillot
On 01/08/2016 12:34 AM, Arjun Pandey wrote:
> Hi
> 
> I am running a 2 node cluster with this config on centos 6.6
> 
> Master/Slave Set: foo-master [foo]
> Masters: [ messi ]
> Stopped: [ronaldo ]
>  eth1-CP(ocf::pw:IPaddr):   Started messi
>  eth2-UP(ocf::pw:IPaddr):   Started messi
>  eth3-UPCP  (ocf::pw:IPaddr):   Started messi
> 
> where i have a multi-state resource foo being run in master/slave mode
> and  IPaddr RA is just modified IPAddr2 RA. Additionally i have a
> collocation constraint for the IP addr to be collocated with the master.
> 
> Now there are cases where i have multiple virtual IP's ( around 20 )
> and for failover time gets substantially increased in these cases.
> Based on the logs what i have observed is the IPaddr resources are
> moved sequentially. Is this really the case ? Also is it possible to
> specify that they can be added simultaneously, since none of them have
> any sort of corelation with the other ?

While pacemaker will of course initiate the moves one by one, it
shouldn't wait for one to be completed before initiating the next one,
unless you have ordering constraints between them or they are in a group
together.

> If it's sequential what is the reason behind it ?
> 
> 
> Thanks in advance.
> 
> Regards
> Arjun


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Automatic Recover for stonith:external/libvirt

2016-01-08 Thread Ken Gaillot
On 01/08/2016 08:56 AM, m...@inwx.de wrote:
> Hello List,
> 
> I have here a test environment for checking pacemaker. Sometimes our
> kvm-hosts with libvirt have trouble with responding the stonith/libvirt
> resource, so I like to configure the service to realize as failed after
> three failed monitoring attempts. I was searching for a configuration 
> here:
> 
> 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html
> 
> 
> But I failed after hours.
> 
> That's the configuration line for stonith/libvirt:
> 
> crm configure primitive p_fence_ha3 stonith:external/libvirt  params
> hostlist="ha3" hypervisor_uri="qemu+tls://debian1/system" op monitor
> interval="60"
> 
> Every 60 seconds pacemaker makes something like this:
> 
>  stonith -t external/libvirt hostlist="ha3"
> hypervisor_uri="qemu+tls://debian1/system" -S
>  ok
> 
> To simulate the unavailability of the kvm host I remove the certificate
> in /etc/libvirt/libvirtd.conf and restart libvirtd. After 60 seconds or
> less I can see the error with "crm status". On the kvm host I add
> certificate again to /etc/libvirt/libvirtd.conf and restart libvirt
> again. Although libvirt is again available the stonith-resource did not
> start again.
> 
> I altered the configuration line for stonith/libvirt with following parts:
> 
>  op monitor interval="60" pcmk_status_retries="3"
>  op monitor interval="60" pcmk_monitor_retries="3"
>  op monitor interval="60" start-delay=180
>  meta migration-threshold="200" failure-timeout="120"
> 
> But always with first failed monitor check after 60 or less seconds
> pacemakers did not resume stonith-libvirt after libvirt is again available.

Is there enough time left in the timeout for the cluster to retry? (The
interval is not the same as the timeout.) Check your pacemaker.log for
messages like "Attempted to execute agent ... the maximum number of
times (...) allowed". That will tell you whether it is retrying.

You definitely don't want start-delay, and migration-threshold doesn't
really mean much for fence devices.

Of course, you also want to fix the underlying problem of libvirt not
being responsive. That doesn't sound like something that should
routinely happen.

BTW I haven't used stonith/external agents (which rely on the
cluster-glue package) myself. I use the fence_virtd daemon on the host
with fence_xvm as the configured fence agent.

> Here is the "crm status"-output on debian 8 (Jessie):
> 
>  root@ha4:~# crm status
>  Last updated: Tue Jan  5 10:04:18 2016
>  Last change: Mon Jan  4 18:18:12 2016
>  Stack: corosync
>  Current DC: ha3 (167772400) - partition with quorum
>  Version: 1.1.12-561c4cf
>  2 Nodes configured
>  2 Resources configured
>  Online: [ ha3 ha4 ]
>  Service-IP (ocf::heartbeat:IPaddr2):   Started ha3
>  haproxy(lsb:haproxy):  Started ha3
>  p_fence_ha3(stonith:external/libvirt): Started ha4
> 
> Kind regards
> 
> Michael R.


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: Early VM resource migration

2016-01-08 Thread Ken Gaillot
On 01/08/2016 07:03 AM, Klechomir wrote:
> Here is what pacemaker says right after node1 comes back after standby:
> 
> Dec 16 16:11:41 [4512] CLUSTER-2pengine:debug:
> native_assign_node: All nodes for resource VM_VM1 are unavailable,
> unclean or shutting down (CLUSTER-1: 1, -100)
> 
> Dec 16 16:11:41 [4512] CLUSTER-2pengine:debug:
> native_assign_node:  Could not allocate a node for VM_VM1
> 
> Dec 16 16:11:41 [4512] CLUSTER-2pengine:debug:
> native_assign_node:  Processing VM_VM1_monitor_1
> 
> Dec 16 16:11:41 [4512] CLUSTER-2pengine: info: native_color:
> Resource VM_VM1 cannot run anywhere
> 
> 
> 
> VM_VM1 gets immediately stopped as soon as node1 re-appears and stays
> down until its "order/colocation AA resource" comes up on node1.
> 
> The curious part is that in the opposite case (node2 comes from
> standby), the failback is ok.
> 
> Any ideas?

This might be a bug. Can you open a report at
http://bugs.clusterlabs.org/ and attach your full CIB and logs from all
nodes both when the issue occurs and when node2 handles it correctly?

> Regards,
> 
> On 17.12.2015 14:51:21 Ulrich Windl wrote:
>> >>> Klechomir  schrieb am 17.12.2015 um 14:16 in
>> Nachricht
>>
>> <2102747.TPh6pTdk8c@bobo>:
>> > Hi Ulrich,
>> > This is only a part of the config, which concerns the problem.
>> > Even with dummy resources, the behaviour will be identical, so don't
>> think
>> > that dlm/clvmd res. config will help solving the problem.
>>
>> You could send logs with the actual startup sequence then.
>>
>> > Regards,
>> > KIecho
>> >
>> > On 17.12.2015 08:19:43 Ulrich Windl wrote:
>> >> >>> Klechomir  schrieb am 16.12.2015 um 17:30 in
>> >> >>> Nachricht
>> >>
>> >> <5671918e.40...@gmail.com>:
>> >> > On 16.12.2015 17:52, Ken Gaillot wrote:
>> >> >> On 12/16/2015 02:09 AM, Klechomir wrote:
>> >> >>> Hi list,
>> >> >>> I have a cluster with VM resources on a cloned active-active
>> storage.
>> >> >>>
>> >> >>> VirtualDomain resource migrates properly during failover (node
>> >> >>> standby),
>> >> >>> but tries to migrate back too early, during failback, ignoring the
>> >> >>> "order" constraint, telling it to start when the cloned storage is
>> >> >>> up.
>> >> >>> This causes unnecessary VM restart.
>> >> >>>
>> >> >>> Is there any way to make it wait, until its storage resource is
>> up?
>> >> >>
>> >> >> Hi Klecho,
>> >> >>
>> >> >> If you have an order constraint, the cluster will not try to
>> start the
>> >> >> VM until the storage resource agent returns success for its
>> start. If
>> >> >> the storage isn't fully up at that point, then the agent is faulty,
>> >> >> and
>> >> >> should be modified to wait until the storage is truly available
>> before
>> >> >> returning success.
>> >> >>
>> >> >> If you post all your constraints, I can look for anything that
>> might
>> >> >> affect the behavior.
>> >> >
>> >> > Thanks for the reply, Ken
>> >> >
>> >> > Seems to me that that the constraints for a cloned resources act a a
>> >> > bit
>> >> > different.
>> >> >
>> >> > Here is my config:
>> >> >
>> >> > primitive p_AA_Filesystem_CDrive1 ocf:heartbeat:Filesystem \
>> >> >
>> >> >  params device="/dev/CSD_CDrive1/AA_CDrive1"
>> >> >
>> >> > directory="/volumes/AA_CDrive1" fstype="ocfs2" options="rw,noatime"
>> >> > primitive VM_VM1 ocf:heartbeat:VirtualDomain \
>> >> >
>> >> >  params config="/volumes/AA_CDrive1/VM_VM1/VM1.xml"
>> >> >
>> >> > hypervisor="qemu:///system" migration_transport="tcp" \
>> >> >
>> >> >  meta allow-migrate="true" target-role="Started"
>> >> >
>> >> > clone AA_Filesystem_CDrive1 p_AA_Filesystem_CDrive1 \
>> >> >
>> >> >  meta interleave="true" resource-stickiness="0"
>> >> >
>> >> > target-role="Started"
>> >> > order VM_VM1_after_AA_Filesystem_CDrive1 inf: AA_Filesystem_CDrive1
>> >> > VM_VM1
>> >> >
>> >> > Every time when a node comes back from standby, the VM tries to live
>> >> > migrate to it long before the filesystem is up.
>> >>
>> >> Hi!
>> >>
>> >> To me your config looks rather incomplete: What about DLM, O2CB, cLVM,
>> >> etc.?>>


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org