Re: [Pacemaker] Starting of multiple booth in the same node

2012-06-19 Thread Yuichi SEINO
Yes, I have. I tested the work after I applied this patch.

Sincerely,
Yuichi

2012/6/20 Jiaju Zhang :
> On Wed, 2012-06-20 at 11:43 +0900, Yuichi SEINO wrote:
>> Hi Jiaju,
>>
>> OK, I can get  the expected  work.
>>
>
> So, does that patch work for you? If yes, I'll merge that patch
> upstream.
>
> Thank you for your reporting and testing;)
>
> Thanks,
> Jiaju
>



-- 
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Starting of multiple booth in the same node

2012-06-19 Thread Yuichi SEINO
Hi Jiaju,

OK, I can get  the expected  work.

Sincerely,
Yuichi

2012/6/19 Jiaju Zhang :
> On Tue, 2012-06-19 at 18:29 +0900, Yuichi SEINO wrote:
>> Hi Jiaju,
>>
>> I updated the pull request after I read your comment.
>
> Yes, I have seen it. However, it seems slightly different with what I
> meant to say in that comment;) So I sent a patch to you just now, could
> you help to see if it works?;)
>
> Thanks,
> Jiaju
>
>



-- 
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource?

2012-06-19 Thread Lars Ellenberg
On Tue, Jun 19, 2012 at 09:38:50AM -0500, Andrew Martin wrote:
> Hello, 
> 
> 
> I have a 3 node Pacemaker+Heartbeat cluster (two real nodes and one "standby" 
> quorum node) with Ubuntu 10.04 LTS on the nodes and using the 
> Pacemaker+Heartbeat packages from the Ubuntu HA Team PPA ( 
> https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa ). I have 
> configured 3 DRBD resources, a filesystem mount, and a KVM-based virtual 
> machine (using the VirtualDomain resource). I have constraints in place so 
> that the DRBD devices must become primary and the filesystem must be mounted 
> before the VM can start: 

> location loc_run_on_most_connected g_vm \ 
> rule $id="loc_run_on_most_connected-rule" p_ping: defined p_ping 

This is the rule

> This has been working well, however last week Pacemaker all of a
> sudden stopped the p_vm_myvm resource and then started it up again. I
> have attached the relevant section of /var/log/daemon.log - I am
> unable to determine what caused Pacemaker to restart this resource.
> Based on the log, could you tell me what event triggered this? 
> 
> 
> Thanks, 
> 
> 
> Andrew 

> Jun 14 15:25:00 vmhost1 lrmd: [3853]: info: rsc:p_sysadmin_notify:0 
> monitor[18] (pid 3661)
> Jun 14 15:25:00 vmhost1 lrmd: [3853]: info: operation monitor[18] on 
> p_sysadmin_notify:0 for client 3856: pid 3661 exited with return code 0
> Jun 14 15:26:42 vmhost1 cib: [3852]: info: cib_stats: Processed 219 
> operations (182.00us average, 0% utilization) in the last 10min
> Jun 14 15:32:43 vmhost1 lrmd: [3853]: info: operation monitor[22] on p_ping:0 
> for client 3856: pid 10059 exited with return code 0
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_vmstore:0 monitor[55] 
> (pid 12323)
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount2:0 monitor[53] 
> (pid 12324)
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[55] on 
> p_drbd_vmstore:0 for client 3856: pid 12323 exited with return code 8
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[53] on 
> p_drbd_mount2:0 for client 3856: pid 12324 exited with return code 8
> Jun 14 15:35:31 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount1:0 monitor[54] 
> (pid 12396)
> Jun 14 15:35:31 vmhost1 lrmd: [3853]: info: operation monitor[54] on 
> p_drbd_mount1:0 for client 3856: pid 12396 exited with return code 8
> Jun 14 15:36:42 vmhost1 cib: [3852]: info: cib_stats: Processed 220 
> operations (272.00us average, 0% utilization) in the last 10min
> Jun 14 15:37:34 vmhost1 lrmd: [3853]: info: rsc:p_vm_myvm monitor[57] (pid 
> 14061)
> Jun 14 15:37:34 vmhost1 lrmd: [3853]: info: operation monitor[57] on 
> p_vm_myvm for client 3856: pid 14061 exited with return code 0

> Jun 14 15:42:35 vmhost1 attrd: [3855]: notice: attrd_trigger_update: Sending 
> flush op to all hosts for: p_ping (1000)
> Jun 14 15:42:35 vmhost1 attrd: [3855]: notice: attrd_perform_update: Sent 
> update 163: p_ping=1000

And here the score on the location constraint changes for this node.

You asked for "run on most connected", and your pingd resource
determined that "the other" one was "better" connected.


> Jun 14 15:42:36 vmhost1 crmd: [3856]: info: do_lrm_rsc_op: Performing 
> key=136:2351:0:7f6d66f7-cfe5-4820-8289-0e47d8c9102b op=p_vm_myvm_stop_0 )
> Jun 14 15:42:36 vmhost1 lrmd: [3853]: info: rsc:p_vm_myvm stop[58] (pid 18174)

...

> Jun 14 15:43:32 vmhost1 attrd: [3855]: notice: attrd_trigger_update: Sending 
> flush op to all hosts for: p_ping (2000)
> Jun 14 15:43:32 vmhost1 attrd: [3855]: notice: attrd_perform_update: Sent 
> update 165: p_ping=2000

And there it is back on 2000 again ...

Lars

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource?

2012-06-19 Thread Lars Ellenberg
On Tue, Jun 19, 2012 at 11:12:46AM -0500, Andrew Martin wrote:
> Hi Emmanuel, 
> 
> 
> Thanks for the idea. I looked through the rest of the log and these
> "return code 8" errors on the ocf:linbit:drbd resources are occurring
> at other intervals (e.g. today) when the VirtualDomain resource is
> unaffected. This seems to indicate that these soft errors do not

No soft error here.
monitor exit code 8 is OCF_RUNNING_MASTER.
expected an healthy.

Lars

> trigger a restart of the VirtualDomain resource. Is there anything
> else in the log that could indicate what caused this, or is there
> somewhere else I can look? 
> 
> 
> Thanks, 
> 
> 
> Andrew 
> 
> - Original Message -
> 
> From: "emmanuel segura" < emi2f...@gmail.com > 
> To: "The Pacemaker cluster resource manager" < pacemaker@oss.clusterlabs.org 
> > 
> Sent: Tuesday, June 19, 2012 9:57:19 AM 
> Subject: Re: [Pacemaker] Why Did Pacemaker Restart this VirtualDomain 
> Resource? 
> 
> I didn't see any error in your config, the only thing i seen it's this 
> == 
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_vmstore:0 
> monitor[55] (pid 12323) 
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount2:0 monitor[53] 
> (pid 12324) 
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[55] on 
> p_drbd_vmstore:0 for client 3856: pid 12323 exited with return code 8 
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[53] on 
> p_drbd_mount2:0 for client 3856: pid 12324 exited with return code 8 
> Jun 14 15:35:31 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount1:0 monitor[54] 
> (pid 12396) 
> = 
> it can be a drbd problem, but i tell you the true i'm not sure 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] resources not migrating when some are not runnable on one node, maybe because of groups or master/slave clones?

2012-06-19 Thread Phil Frost

On 06/19/2012 04:31 PM, David Vossel wrote:

Can you attach a crm_report of what happens when you put the two nodes in 
standby please?  Being able to see the xml and how the policy engine evaluates 
the transitions is helpful.


The resulting reports were a bit big for the list, so I put them in a 
bug report:


https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2652

I've also found a similar discussion in the archives, though I didn't 
find much help in it:


http://oss.clusterlabs.org/pipermail/pacemaker/2010-November/008189.html


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] resources not migrating when some are not runnable on one node, maybe because of groups or master/slave clones?

2012-06-19 Thread David Vossel
- Original Message -
> From: "Phil Frost" 
> To: "The Pacemaker cluster resource manager" 
> Sent: Monday, June 18, 2012 8:39:48 AM
> Subject: [Pacemaker] resources not migrating when some are not runnable on 
> one node, maybe because of groups or
> master/slave clones?
> 
> I'm attempting to configure an NFS cluster, and I've observed that
> under
> some failure conditions, resources that depend on a failed resource
> simply stop, and no migration to another node is attempted, even
> though
> a manual migration demonstrates the other node can run all resources,
> and the resources will remain on the good node even after the
> migration
> constraint is removed.
> 
> I was able to reduce the configuration to this:
> 
> node storage01
> node storage02
> primitive drbd_nfsexports ocf:pacemaker:Stateful
> primitive fs_test ocf:pacemaker:Dummy
> primitive vg_nfsexports ocf:pacemaker:Dummy
> group test fs_test
> ms drbd_nfsexports_ms drbd_nfsexports \
>  meta master-max="1" master-node-max="1" \
>  clone-max="2" clone-node-max="1" \
>  notify="true" target-role="Started"
> location l fs_test -inf: storage02
> colocation colo_drbd_master inf: ( test ) ( vg_nfsexports ) (
> drbd_nfsexports_ms:Master )
> property $id="cib-bootstrap-options" \
>  no-quorum-policy="ignore" \
>  stonith-enabled="false" \
>  dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff"
>  \
>  cluster-infrastructure="openais" \
>  expected-quorum-votes="2" \
>  last-lrm-refresh="1339793579"
> 
> The location constraint "l" exists only to demonstrate the problem; I
> added it to simulate the NFS server being unrunnable on one node.
> 
> To see the issue I'm experiencing, put storage01 in standby to force
> everything on storage02. fs_test will not be able to run. Now bring
> storage01, which can satisfy all the constraints, and see that no
> migration takes place. Put storage02 in standby, and everything will
> migrate to storage01 and start successfully. Take storage02 out of
> standby, and the services remain on storage01. This demonstrates that
> even though there is a clear "best" solution where all resources can
> run, Pacemaker isn't finding it.

Can you attach a crm_report of what happens when you put the two nodes in 
standby please?  Being able to see the xml and how the policy engine evaluates 
the transitions is helpful.

-- Vossel

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource?

2012-06-19 Thread Andrew Martin
Hi Emmanuel, 


Here is the output from crm_mon -of1 : 

Operations: 
* Node quorumnode: 
p_drbd_mount2:0: migration-threshold=100 
+ (4) probe: rc=5 (not installed) 
p_drbd_mount1:0: migration-threshold=100 
+ (5) probe: rc=5 (not installed) 
p_drbd_vmstore:0: migration-threshold=100 
+ (6) probe: rc=5 (not installed) 
p_vm_myvm: migration-threshold=100 
+ (12) probe: rc=5 (not installed) 
* Node vmhost1: 
p_drbd_mount2:0: migration-threshold=100 
+ (34) promote: rc=0 (ok) 
+ (62) monitor: interval=1ms rc=8 (master) 
p_drbd_vmstore:0: migration-threshold=100 
+ (26) promote: rc=0 (ok) 
+ (64) monitor: interval=1ms rc=8 (master) 
p_fs_vmstore: migration-threshold=100 
+ (36) start: rc=0 (ok) 
+ (38) monitor: interval=2ms rc=0 (ok) 
p_ping:0: migration-threshold=100 
+ (12) start: rc=0 (ok) 
+ (22) monitor: interval=2ms rc=0 (ok) 
p_vm_myvm: migration-threshold=100 
+ (65) start: rc=0 (ok) 
+ (66) monitor: interval=1ms rc=0 (ok) 
stonithvmhost2: migration-threshold=100 
+ (17) start: rc=0 (ok) 
p_drbd_mount1:0: migration-threshold=100 
+ (31) promote: rc=0 (ok) 
+ (63) monitor: interval=1ms rc=8 (master) 
p_sysadmin_notify:0: migration-threshold=100 
+ (13) start: rc=0 (ok) 
+ (18) monitor: interval=1ms rc=0 (ok) 
* Node vmhost2: 
p_drbd_mount2:1: migration-threshold=100 
+ (14) start: rc=0 (ok) 
+ (36) monitor: interval=2ms rc=0 (ok) 
p_drbd_vmstore:1: migration-threshold=100 
+ (16) start: rc=0 (ok) 
+ (38) monitor: interval=2ms rc=0 (ok) 
p_ping:1: migration-threshold=100 
+ (12) start: rc=0 (ok) 
+ (20) monitor: interval=2ms rc=0 (ok) 
stonithquorumnode: migration-threshold=100 
+ (18) start: rc=0 (ok) 
stonithvmhost1: migration-threshold=100 
+ (17) start: rc=0 (ok) 
p_sysadmin_notify:1: migration-threshold=100 
+ (13) start: rc=0 (ok) 
+ (19) monitor: interval=1ms rc=0 (ok) 
p_drbd_mount1:1: migration-threshold=100 
+ (15) start: rc=0 (ok) 
+ (37) monitor: interval=2ms rc=0 (ok) 


Failed actions: 
p_drbd_mount2:0_monitor_0 (node=quorumnode, call=4, rc=5, status=complete): not 
installed 
p_drbd_mount1:0_monitor_0 (node=quorumnode, call=5, rc=5, status=complete): not 
installed 
p_drbd_vmstore:0_monitor_0 (node=quorumnode, call=6, rc=5, status=complete): 
not installed 
p_vm_myvm_monitor_0 (node=quorumnode, call=12, rc=5, status=complete): not 
installed 


What is the number in parenthesis before "start" or "monitor"? Is it the number 
of times this operation has occurred? Does this give any additional clues to 
what happened? What should I look for specifically in this output? 


Thanks, 


Andrew 
- Original Message -

From: "emmanuel segura"  
To: "The Pacemaker cluster resource manager"  
Sent: Tuesday, June 19, 2012 12:12:34 PM 
Subject: Re: [Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource? 

Hello Andrew 

use crm_mon -of when your virtualdomain resource fail to see which operation 
resource report the problem 


2012/6/19 Andrew Martin < amar...@xes-inc.com > 




Hi Emmanuel, 


Thanks for the idea. I looked through the rest of the log and these "return 
code 8" errors on the ocf:linbit:drbd resources are occurring at other 
intervals (e.g. today) when the VirtualDomain resource is unaffected. This 
seems to indicate that these soft errors do not trigger a restart of the 
VirtualDomain resource. Is there anything else in the log that could indicate 
what caused this, or is there somewhere else I can look? 


Thanks, 


Andrew 



From: "emmanuel segura" < emi2f...@gmail.com > 
To: "The Pacemaker cluster resource manager" < pacemaker@oss.clusterlabs.org > 
Sent: Tuesday, June 19, 2012 9:57:19 AM 
Subject: Re: [Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource? 


I didn't see any error in your config, the only thing i seen it's this 
== 
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_vmstore:0 
monitor[55] (pid 12323) 
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount2:0 monitor[53] 
(pid 12324) 
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[55] on 
p_drbd_vmstore:0 for client 3856: pid 12323 exited with return code 8 
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[53] on 
p_drbd_mount2:0 for client 3856: pid 12324 exited with return code 8 
Jun 14 15:35:31 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount1:0 monitor[54] 
(pid 12396) 
= 
it can be a drbd problem, but i tell you the true i'm not sure 

== 
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-ocf-return-codes.html
 
= 

2012/6/19 Andrew Martin < amar...@xes-inc.com > 

> Hello, 
> 
> I have a 3 node Pacemaker+Heartbeat cluster (two real nodes and one 
> "standby" quorum node) with Ub

Re: [Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource?

2012-06-19 Thread emmanuel segura
Hello Andrew

use crm_mon -of when your virtualdomain resource fail to see which
operation resource report the problem

2012/6/19 Andrew Martin 

> Hi Emmanuel,
>
> Thanks for the idea. I looked through the rest of the log and these
> "return code 8" errors on the ocf:linbit:drbd resources are occurring at
> other intervals (e.g. today) when the VirtualDomain resource is unaffected.
> This seems to indicate that these soft errors do not trigger a restart of
> the VirtualDomain resource. Is there anything else in the log that could
> indicate what caused this, or is there somewhere else I can look?
>
> Thanks,
>
> Andrew
>
> --
> *From: *"emmanuel segura" 
> *To: *"The Pacemaker cluster resource manager" <
> pacemaker@oss.clusterlabs.org>
> *Sent: *Tuesday, June 19, 2012 9:57:19 AM
> *Subject: *Re: [Pacemaker] Why Did Pacemaker Restart this
> VirtualDomainResource?
>
>
> I didn't see any error in your config, the only thing i seen it's this
> ==
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_vmstore:0
> monitor[55] (pid 12323)
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount2:0 monitor[53]
> (pid 12324)
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[55] on
> p_drbd_vmstore:0 for client 3856: pid 12323 exited with return code 8
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[53] on
> p_drbd_mount2:0 for client 3856: pid 12324 exited with return code 8
> Jun 14 15:35:31 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount1:0 monitor[54]
> (pid 12396)
> =
> it can be a drbd problem, but i tell you the true i'm not sure
>
> ==
>
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-ocf-return-codes.html
> =
>
> 2012/6/19 Andrew Martin 
>
> > Hello,
> >
> > I have a 3 node Pacemaker+Heartbeat cluster (two real nodes and one
> > "standby" quorum node) with Ubuntu 10.04 LTS on the nodes and using the
> > Pacemaker+Heartbeat packages from the Ubuntu HA Team PPA (
> > https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa
> ).
>
> > I have configured 3 DRBD resources, a filesystem mount, and a KVM-based
> > virtual machine (using the VirtualDomain resource). I have constraints in
> > place so that the DRBD devices must become primary and the filesystem
> must
> > be mounted before the VM can start:
> > node $id="1ab0690c-5aa0-4d9c-ae4e-b662e0ca54e5" vmhost1
> > node $id="219e9bf6-ea99-41f4-895f-4c2c5c78484a" quorumnode \
> > attributes standby="on"
> > node $id="645e09b4-aee5-4cec-a241-8bd4e03a78c3" vmhost2
> > primitive p_drbd_mount2 ocf:linbit:drbd \
> > params drbd_resource="mount2" \
> > op start interval="0" timeout="240" \
> > op stop interval="0" timeout="100" \
> > op monitor interval="10" role="Master" timeout="30" \
> > op monitor interval="20" role="Slave" timeout="30"
> > primitive p_drbd_mount1 ocf:linbit:drbd \
> > params drbd_resource="mount1" \
> > op start interval="0" timeout="240" \
> > op stop interval="0" timeout="100" \
> > op monitor interval="10" role="Master" timeout="30" \
> > op monitor interval="20" role="Slave" timeout="30"
> > primitive p_drbd_vmstore ocf:linbit:drbd \
> > params drbd_resource="vmstore" \
> > op start interval="0" timeout="240" \
> > op stop interval="0" timeout="100" \
> > op monitor interval="10" role="Master" timeout="30" \
> > op monitor interval="20" role="Slave" timeout="30"
> > primitive p_fs_vmstore ocf:heartbeat:Filesystem \
> > params device="/dev/drbd0" directory="/mnt/storage/vmstore"
> > fstype="ext4" \
> > op start interval="0" timeout="60" \
> > op stop interval="0" timeout="60" \
> > op monitor interval="20" timeout="40"
> > primitive p_ping ocf:pacemaker:ping \
> > params name="p_ping" host_list="192.168.1.25 192.168.1.26"
> > multiplier="1000" \
> > op start interval="0" timeout="60" \
> > op monitor interval="20s" timeout="60"
> > primitive p_sysadmin_notify ocf:heartbeat:MailTo \
> > params email="al...@example.com" \
> > params subject="Pacemaker Change" \
> > op start interval="0" timeout="10" \
> > op stop interval="0" timeout="10" \
> > op monitor interval="10" timeout="10"
> > primitive p_vm_myvm ocf:heartbeat:VirtualDomain \
> > params config="/mnt/storage/vmstore/config/myvm.xml" \
> > meta allow-migrate="false" target-role="Started"
> is-managed="true"
> > \
> > op start interval="0" timeout="180" \
> > op stop interval="0" timeout="180" \
> > 

Re: [Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource?

2012-06-19 Thread Andrew Martin
Hi Emmanuel, 


Thanks for the idea. I looked through the rest of the log and these "return 
code 8" errors on the ocf:linbit:drbd resources are occurring at other 
intervals (e.g. today) when the VirtualDomain resource is unaffected. This 
seems to indicate that these soft errors do not trigger a restart of the 
VirtualDomain resource. Is there anything else in the log that could indicate 
what caused this, or is there somewhere else I can look? 


Thanks, 


Andrew 

- Original Message -

From: "emmanuel segura" < emi2f...@gmail.com > 
To: "The Pacemaker cluster resource manager" < pacemaker@oss.clusterlabs.org > 
Sent: Tuesday, June 19, 2012 9:57:19 AM 
Subject: Re: [Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource? 

I didn't see any error in your config, the only thing i seen it's this 
== 
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_vmstore:0 
monitor[55] (pid 12323) 
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount2:0 monitor[53] 
(pid 12324) 
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[55] on 
p_drbd_vmstore:0 for client 3856: pid 12323 exited with return code 8 
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[53] on 
p_drbd_mount2:0 for client 3856: pid 12324 exited with return code 8 
Jun 14 15:35:31 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount1:0 monitor[54] 
(pid 12396) 
= 
it can be a drbd problem, but i tell you the true i'm not sure 

== 
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-ocf-return-codes.html
 
= 

2012/6/19 Andrew Martin < amar...@xes-inc.com > 

> Hello, 
> 
> I have a 3 node Pacemaker+Heartbeat cluster (two real nodes and one 
> "standby" quorum node) with Ubuntu 10.04 LTS on the nodes and using the 
> Pacemaker+Heartbeat packages from the Ubuntu HA Team PPA ( 
> https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa < 
> https://launchpad.net/%7Eubuntu-ha-maintainers/+archive/ppa >). 
> I have configured 3 DRBD resources, a filesystem mount, and a KVM-based 
> virtual machine (using the VirtualDomain resource). I have constraints in 
> place so that the DRBD devices must become primary and the filesystem must 
> be mounted before the VM can start: 
> node $id="1ab0690c-5aa0-4d9c-ae4e-b662e0ca54e5" vmhost1 
> node $id="219e9bf6-ea99-41f4-895f-4c2c5c78484a" quorumnode \ 
> attributes standby="on" 
> node $id="645e09b4-aee5-4cec-a241-8bd4e03a78c3" vmhost2 
> primitive p_drbd_mount2 ocf:linbit:drbd \ 
> params drbd_resource="mount2" \ 
> op start interval="0" timeout="240" \ 
> op stop interval="0" timeout="100" \ 
> op monitor interval="10" role="Master" timeout="30" \ 
> op monitor interval="20" role="Slave" timeout="30" 
> primitive p_drbd_mount1 ocf:linbit:drbd \ 
> params drbd_resource="mount1" \ 
> op start interval="0" timeout="240" \ 
> op stop interval="0" timeout="100" \ 
> op monitor interval="10" role="Master" timeout="30" \ 
> op monitor interval="20" role="Slave" timeout="30" 
> primitive p_drbd_vmstore ocf:linbit:drbd \ 
> params drbd_resource="vmstore" \ 
> op start interval="0" timeout="240" \ 
> op stop interval="0" timeout="100" \ 
> op monitor interval="10" role="Master" timeout="30" \ 
> op monitor interval="20" role="Slave" timeout="30" 
> primitive p_fs_vmstore ocf:heartbeat:Filesystem \ 
> params device="/dev/drbd0" directory="/mnt/storage/vmstore" 
> fstype="ext4" \ 
> op start interval="0" timeout="60" \ 
> op stop interval="0" timeout="60" \ 
> op monitor interval="20" timeout="40" 
> primitive p_ping ocf:pacemaker:ping \ 
> params name="p_ping" host_list="192.168.1.25 192.168.1.26" 
> multiplier="1000" \ 
> op start interval="0" timeout="60" \ 
> op monitor interval="20s" timeout="60" 
> primitive p_sysadmin_notify ocf:heartbeat:MailTo \ 
> params email=" al...@example.com " \ 
> params subject="Pacemaker Change" \ 
> op start interval="0" timeout="10" \ 
> op stop interval="0" timeout="10" \ 
> op monitor interval="10" timeout="10" 
> primitive p_vm_myvm ocf:heartbeat:VirtualDomain \ 
> params config="/mnt/storage/vmstore/config/myvm.xml" \ 
> meta allow-migrate="false" target-role="Started" is-managed="true" 
> \ 
> op start interval="0" timeout="180" \ 
> op stop interval="0" timeout="180" \ 
> op monitor interval="10" timeout="30" 
> primitive stonithquorumnode stonith:external/webpowerswitch \ 
> params wps_ipaddr="192.168.3.100" wps_port="x" wps_username="xxx" 
> wps_password="xxx" hostname_to_stonith="quorumnode" 
> primitive stonithvmhost1 stonith:external/webpowerswitch \ 
> params wps_ipaddr="192.168.3.100" wps_port="x" wps_username="xxx" 
> wps_password="xxx" hostname_to_stonith="vmhost1" 
> primitive stonithvmhost2 stonith:external/webpowerswitch \ 
> params wps_ipaddr="192.168.3.100" wps_port="x" wps_username="

Re: [Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource?

2012-06-19 Thread emmanuel segura
I didn't see any error in your config, the only thing i seen it's this
==
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_vmstore:0
monitor[55] (pid 12323)
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount2:0 monitor[53]
(pid 12324)
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[55] on
p_drbd_vmstore:0 for client 3856: pid 12323 exited with return code 8
Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[53] on
p_drbd_mount2:0 for client 3856: pid 12324 exited with return code 8
Jun 14 15:35:31 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount1:0 monitor[54]
(pid 12396)
=
it can be a drbd problem, but i tell you the true i'm not sure

==
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-ocf-return-codes.html
=

2012/6/19 Andrew Martin 

> Hello,
>
> I have a 3 node Pacemaker+Heartbeat cluster (two real nodes and one
> "standby" quorum node) with Ubuntu 10.04 LTS on the nodes and using the
> Pacemaker+Heartbeat packages from the Ubuntu HA Team PPA (
> https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa).
> I have configured 3 DRBD resources, a filesystem mount, and a KVM-based
> virtual machine (using the VirtualDomain resource). I have constraints in
> place so that the DRBD devices must become primary and the filesystem must
> be mounted before the VM can start:
> node $id="1ab0690c-5aa0-4d9c-ae4e-b662e0ca54e5" vmhost1
> node $id="219e9bf6-ea99-41f4-895f-4c2c5c78484a" quorumnode \
> attributes standby="on"
> node $id="645e09b4-aee5-4cec-a241-8bd4e03a78c3" vmhost2
> primitive p_drbd_mount2 ocf:linbit:drbd \
> params drbd_resource="mount2" \
> op start interval="0" timeout="240" \
> op stop interval="0" timeout="100" \
> op monitor interval="10" role="Master" timeout="30" \
> op monitor interval="20" role="Slave" timeout="30"
> primitive p_drbd_mount1 ocf:linbit:drbd \
> params drbd_resource="mount1" \
> op start interval="0" timeout="240" \
> op stop interval="0" timeout="100" \
> op monitor interval="10" role="Master" timeout="30" \
> op monitor interval="20" role="Slave" timeout="30"
> primitive p_drbd_vmstore ocf:linbit:drbd \
> params drbd_resource="vmstore" \
> op start interval="0" timeout="240" \
> op stop interval="0" timeout="100" \
> op monitor interval="10" role="Master" timeout="30" \
> op monitor interval="20" role="Slave" timeout="30"
> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/mnt/storage/vmstore"
> fstype="ext4" \
> op start interval="0" timeout="60" \
> op stop interval="0" timeout="60" \
> op monitor interval="20" timeout="40"
> primitive p_ping ocf:pacemaker:ping \
> params name="p_ping" host_list="192.168.1.25 192.168.1.26"
> multiplier="1000" \
> op start interval="0" timeout="60" \
> op monitor interval="20s" timeout="60"
> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
> params email="al...@example.com" \
> params subject="Pacemaker Change" \
> op start interval="0" timeout="10" \
> op stop interval="0" timeout="10" \
> op monitor interval="10" timeout="10"
> primitive p_vm_myvm ocf:heartbeat:VirtualDomain \
> params config="/mnt/storage/vmstore/config/myvm.xml" \
> meta allow-migrate="false" target-role="Started" is-managed="true"
> \
> op start interval="0" timeout="180" \
> op stop interval="0" timeout="180" \
> op monitor interval="10" timeout="30"
> primitive stonithquorumnode stonith:external/webpowerswitch \
> params wps_ipaddr="192.168.3.100" wps_port="x" wps_username="xxx"
> wps_password="xxx" hostname_to_stonith="quorumnode"
> primitive stonithvmhost1 stonith:external/webpowerswitch \
> params wps_ipaddr="192.168.3.100" wps_port="x" wps_username="xxx"
> wps_password="xxx" hostname_to_stonith="vmhost1"
> primitive stonithvmhost2 stonith:external/webpowerswitch \
> params wps_ipaddr="192.168.3.100" wps_port="x" wps_username="xxx"
> wps_password="xxx" hostname_to_stonith="vmhost2"
> group g_vm p_fs_vmstore p_vm_myvm
> ms ms_drbd_mount2 p_drbd_mount2 \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> ms ms_drbd_mount1 p_drbd_mount1 \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> ms ms_drbd_vmstore p_drbd_vmstore \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> clone cl_ping p_ping \
> meta interleave="true"
> clone cl_sysadmin_not

[Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource?

2012-06-19 Thread Andrew Martin
Hello, 


I have a 3 node Pacemaker+Heartbeat cluster (two real nodes and one "standby" 
quorum node) with Ubuntu 10.04 LTS on the nodes and using the 
Pacemaker+Heartbeat packages from the Ubuntu HA Team PPA ( 
https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa ). I have configured 
3 DRBD resources, a filesystem mount, and a KVM-based virtual machine (using 
the VirtualDomain resource). I have constraints in place so that the DRBD 
devices must become primary and the filesystem must be mounted before the VM 
can start: 

node $id="1ab0690c-5aa0-4d9c-ae4e-b662e0ca54e5" vmhost1 
node $id="219e9bf6-ea99-41f4-895f-4c2c5c78484a" quorumnode \ 
attributes standby="on" 
node $id="645e09b4-aee5-4cec-a241-8bd4e03a78c3" vmhost2 
primitive p_drbd_mount2 ocf:linbit:drbd \ 
params drbd_resource="mount2" \ 
op start interval="0" timeout="240" \ 
op stop interval="0" timeout="100" \ 
op monitor interval="10" role="Master" timeout="30" \ 
op monitor interval="20" role="Slave" timeout="30" 
primitive p_drbd_mount1 ocf:linbit:drbd \ 
params drbd_resource="mount1" \ 
op start interval="0" timeout="240" \ 
op stop interval="0" timeout="100" \ 
op monitor interval="10" role="Master" timeout="30" \ 
op monitor interval="20" role="Slave" timeout="30" 
primitive p_drbd_vmstore ocf:linbit:drbd \ 
params drbd_resource="vmstore" \ 
op start interval="0" timeout="240" \ 
op stop interval="0" timeout="100" \ 
op monitor interval="10" role="Master" timeout="30" \ 
op monitor interval="20" role="Slave" timeout="30" 
primitive p_fs_vmstore ocf:heartbeat:Filesystem \ 
params device="/dev/drbd0" directory="/mnt/storage/vmstore" fstype="ext4" \ 
op start interval="0" timeout="60" \ 
op stop interval="0" timeout="60" \ 
op monitor interval="20" timeout="40" 
primitive p_ping ocf:pacemaker:ping \ 
params name="p_ping" host_list="192.168.1.25 192.168.1.26" multiplier="1000" \ 
op start interval="0" timeout="60" \ 
op monitor interval="20s" timeout="60" 
primitive p_sysadmin_notify ocf:heartbeat:MailTo \ 
params email="al...@example.com" \ 
params subject="Pacemaker Change" \ 
op start interval="0" timeout="10" \ 
op stop interval="0" timeout="10" \ 
op monitor interval="10" timeout="10" 
primitive p_vm_myvm ocf:heartbeat:VirtualDomain \ 
params config="/mnt/storage/vmstore/config/myvm.xml" \ 
meta allow-migrate="false" target-role="Started" is-managed="true" \ 
op start interval="0" timeout="180" \ 
op stop interval="0" timeout="180" \ 
op monitor interval="10" timeout="30" 
primitive stonithquorumnode stonith:external/webpowerswitch \ 
params wps_ipaddr="192.168.3.100" wps_port="x" wps_username="xxx" 
wps_password="xxx" hostname_to_stonith="quorumnode" 
primitive stonithvmhost1 stonith:external/webpowerswitch \ 
params wps_ipaddr="192.168.3.100" wps_port="x" wps_username="xxx" 
wps_password="xxx" hostname_to_stonith="vmhost1" 
primitive stonithvmhost2 stonith:external/webpowerswitch \ 
params wps_ipaddr="192.168.3.100" wps_port="x" wps_username="xxx" 
wps_password="xxx" hostname_to_stonith="vmhost2" 
group g_vm p_fs_vmstore p_vm_myvm 
ms ms_drbd_mount2 p_drbd_mount2 \ 
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" 
notify="true" 
ms ms_drbd_mount1 p_drbd_mount1 \ 
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" 
notify="true" 
ms ms_drbd_vmstore p_drbd_vmstore \ 
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" 
notify="true" 
clone cl_ping p_ping \ 
meta interleave="true" 
clone cl_sysadmin_notify p_sysadmin_notify 
location loc_run_on_most_connected g_vm \ 
rule $id="loc_run_on_most_connected-rule" p_ping: defined p_ping 
location loc_st_nodescan stonithquorumnode -inf: vmhost1 
location loc_st_vmhost1 stonithvmhost1 -inf: vmhost1 
location loc_st_vmhost2 stonithvmhost2 -inf: vmhost2 
colocation c_drbd_libvirt_vm inf: g_vm ms_drbd_vmstore:Master 
ms_drbd_tools:Master ms_drbd_crm:Master 
order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_tools:promote 
ms_drbd_crm:promote g_vm:start 
property $id="cib-bootstrap-options" \ 
dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \ 
cluster-infrastructure="Heartbeat" \ 
stonith-enabled="true" \ 
no-quorum-policy="freeze" \ 
last-lrm-refresh="1337746179" 


This has been working well, however last week Pacemaker all of a sudden stopped 
the p_vm_myvm resource and then started it up again. I have attached the 
relevant section of /var/log/daemon.log - I am unable to determine what caused 
Pacemaker to restart this resource. Based on the log, could you tell me what 
event triggered this? 


Thanks, 


Andrew Jun 14 15:25:00 vmhost1 lrmd: [3853]: info: rsc:p_sysadmin_notify:0 monitor[18] (pid 3661)
Jun 14 15:25:00 vmhost1 lrmd: [3853]: info: operation monitor[18] on p_sysadmin_notify:0 for client 3856: pid 3661 exited with return code 0
Jun 14 15:26:42 vmhost1 cib: [3852]: info: cib_stats: Processed 219 operations (182.00us average, 0% utilization) in the last 10min
Jun 14 15:32:43 vmhost1 lrmd: [3853]: in

Re: [Pacemaker] Starting of multiple booth in the same node

2012-06-19 Thread Jiaju Zhang
On Tue, 2012-06-19 at 18:29 +0900, Yuichi SEINO wrote:
> Hi Jiaju,
> 
> I updated the pull request after I read your comment.

Yes, I have seen it. However, it seems slightly different with what I
meant to say in that comment;) So I sent a patch to you just now, could
you help to see if it works?;)

Thanks,
Jiaju



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Starting of multiple booth in the same node

2012-06-19 Thread Jiaju Zhang
Hi Yuichi,

On Thu, 2012-06-14 at 19:20 +0900, Yuichi SEINO wrote:
> Hi Jiaju,
> 
> I have a question about booth.
> Can you allow multiple booth to start in the same node?
> 
> If it is restrict, then I think that we need to change the error message.
> Currently, the following error message is output when multiple booth started.
> 
> ERROR: failed to bind socket Address already in use
> or
> ERROR: bind error -1: Address already in use (98)
> 
> Though, this error message is also output for another case. So I want
> to identify starting of multiple booth
> from a error message. I guess that we can determine this case from the pid 
> file.

I'm attaching a patch which intends to fix this issue. Could you give it
a try to see if it works for you?

Thanks a lot;)
Jiaju


Reported-by: Yuich SEINO 
Signed-off-by: Jiaju Zhang 
---
 src/main.c |   26 ++
 1 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/src/main.c b/src/main.c
index 8e51007..4736a93 100644
--- a/src/main.c
+++ b/src/main.c
@@ -447,14 +447,12 @@ static int setup_timer(void)
return timerlist_init();
 }
 
-static int setup(int type)
+static int loop(void)
 {
-   int rv;
-
-   rv = setup_config(type);
-   if (rv < 0)
-   goto fail;
-
+   void (*workfn) (int ci);
+   void (*deadfn) (int ci);
+   int rv, i;
+   
rv = setup_timer();
if (rv < 0)
goto fail;
@@ -472,18 +470,6 @@ static int setup(int type)
goto fail;
client_add(rv, process_listener, NULL);
 
-   return 0;
-
-fail:
-   return -1;
-}
-
-static int loop(void)
-{
-   void (*workfn) (int ci);
-   void (*deadfn) (int ci);
-   int rv, i;
-
 while (1) {
 rv = poll(pollfd, client_maxi + 1, poll_timeout);
 if (rv == -1 && errno == EINTR)
@@ -947,7 +933,7 @@ static int do_server(int type)
int fd = -1;
int rv = -1;
 
-   rv = setup(type);
+   rv = setup_config(type);
if (rv < 0)
goto out;
 



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Starting of multiple booth in the same node

2012-06-19 Thread Yuichi SEINO
Hi Jiaju,

I updated the pull request after I read your comment.

Sincerely,
Yuichi

2012/6/15 Yuichi SEINO :
> Hi Jiaju,
>
> OK.
> I have a mind to write a patch.
>
> Sincerely,
> Yuichi
>
>
> 2012/6/14 Jiaju Zhang :
>> On Thu, 2012-06-14 at 19:20 +0900, Yuichi SEINO wrote:
>>> Hi Jiaju,
>>>
>>> I have a question about booth.
>>> Can you allow multiple booth to start in the same node?
>>
>> Multiple booth daemons are not allowed in the same node.
>>
>>>
>>> If it is restrict, then I think that we need to change the error message.
>>> Currently, the following error message is output when multiple booth 
>>> started.
>>>
>>> ERROR: failed to bind socket Address already in use
>>> or
>>> ERROR: bind error -1: Address already in use (98)
>>>
>>> Though, this error message is also output for another case. So I want
>>> to identify starting of multiple booth
>>> from a error message. I guess that we can determine this case from the pid 
>>> file.
>>
>> Good suggestion. Would you mind writing a patch to implement this?;)
>>
>> Thanks,
>> Jiaju
>>
>>
>>
>
>
>
> --
> Yuichi SEINO
> METROSYSTEMS CORPORATION
> E-mail:seino.clust...@gmail.com



-- 
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org