Re: [Pacemaker] Installing pacemaker on aws ec2 server

2012-12-17 Thread Yossi Nachum
I fix this error using LIBS enviroment variable
I run: export LIBS=/lib64/libtinfo.so.5
then ./configure again and then make completed successfully

On Mon, Dec 17, 2012 at 9:02 AM, Yossi Nachum nachum...@gmail.com wrote:

 Hi,
 I am trying to install pacemaker on amazon ec2 ami instance.
 I tried to install using the packages from pacemaker repository but had
 many missing dependencies to I tried to compile from source.
 I downlad the source using git run ./autogen.sh and configure successfully
 but when I tried to make I get the following error:

 make[1]: Entering directory `/usr/local/src/pacemaker/tools'
   CCLD   crm_mon
 /usr/bin/ld: crm_mon.o: undefined reference to symbol 'cbreak'
 /usr/bin/ld: note: 'cbreak' is defined in DSO /lib64/libtinfo.so.5 so try
 adding it to the linker command line
 /lib64/libtinfo.so.5: could not read symbols: Invalid operation
 collect2: ld returned 1 exit status
 make[1]: *** [crm_mon] Error 1
 make[1]: Leaving directory `/usr/local/src/pacemaker/tools'
 make: *** [core] Error 1

 I tried to google it but didn't find a solution or I don't know how to add
 /lib64/libtinfo.so.5 to the linker command

 can anyone help?

 Yossi

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] HA FTP Server in aws vpc

2012-12-17 Thread Yossi Nachum
Hi,
I want to run ftp server in active passive mode in amazon aws environment.
I use a vpc and two subnets: ftp-1 is on 192.168.10.x and ftp-2 is on
192.168.20.x
The two subnets are in different availability zones.
In this configuration I don't see how can I use a vip so I thought of
creating an init script that change the DNS record when one server become
the active server.

what do you think? does anyone have more elgant solution for this?

Thanks
Yossi
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Improvement for the communication failure of booth

2012-12-17 Thread yusuke iida
Hi, Jiaju

I would like to attach the function which displays a communicative
state on booth.
In the present booth, when communication between sites stops service,
no errors are told.
If it becomes like this, the user cannot notice a problem.
I think that he would like to define newly the variable which saves
the communication state of paxos, in order to solve this problem.
I want to display on the client command, and its state.
Is this thought realistic?
Are there any other good idea?

Regards,
Yusuke
--

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] HA FTP Server in aws vpc

2012-12-17 Thread Art Zemon

Have you thought about using a load balancer instead of a VIP? The ELB can span 
subnets.
 
-- Art Z.
 
-Original Message-
From: Yossi Nachum nachum...@gmail.com
Sent: Monday, December 17, 2012 2:22am
To: pacemaker@oss.clusterlabs.org
Subject: [Pacemaker] HA FTP Server in aws vpc



Hi,
I want to run ftp server in active passive mode in amazon aws environment.
I use a vpc and two subnets: ftp-1 is on 192.168.10.x and ftp-2 is on 
192.168.20.x
The two subnets are in different availability zones.
In this configuration I don't see how can I use a vip so I thought of creating 
an init script that change the DNS record when one server become the active 
server.
what do you think? does anyone have more elgant solution for this?
Thanks
Yossi

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] HA FTP Server in aws vpc

2012-12-17 Thread Yossi Nachum
I can't use ELB with ftp port.
The ports that ELB can listen to are: 25, 80, 443 or 1024-65535

On Mon, Dec 17, 2012 at 3:25 PM, Art Zemon a...@hens-teeth.net wrote:

 Have you thought about using a load balancer instead of a VIP? The ELB can
 span subnets.



 -- Art Z.


 -Original Message-
 From: Yossi Nachum nachum...@gmail.com
 Sent: Monday, December 17, 2012 2:22am
 To: pacemaker@oss.clusterlabs.org
 Subject: [Pacemaker] HA FTP Server in aws vpc



 Hi,
 I want to run ftp server in active passive mode in amazon aws environment.
 I use a vpc and two subnets: ftp-1 is on 192.168.10.x and ftp-2 is on
 192.168.20.x
 The two subnets are in different availability zones.
 In this configuration I don't see how can I use a vip so I thought of
 creating an init script that change the DNS record when one server become
 the active server.
 what do you think? does anyone have more elgant solution for this?
 Thanks
 Yossi


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Action from a different CRMD transition results in

2012-12-17 Thread Latrous, Youssef
Hi Andrew,

Thank you for following up.

I still don't see what went wrong. From the logs, RabbitMQ was working
just fine around that time until it was ordered to shut down by CRM (for
the failed monitor?).

Moreover, I assume that transitions are ordered monotonically, which
means that Transition ID 16048 happened before Transition ID 18014:
  16048  18014

According to the logs, Transition ID 16048 wasn't present in the logs
dating several days before transition ID 18014 was generated. I'll then
assume that it was generated several days ago (if not true, please give
me a way of finding out when did this transition happen - I still
believe that time is of essence in this case). Our monitor command
timers are expressed in seconds.

In that case, how can we say:
   It hasn't only just acted now. Its been repeating over and over for
the last few weeks or so.

My understanding is that a transition happens once and only once: it
succeeds, fails or is aborted altogether. Corresponding events can
repeat over and over, but each time can only be part a new transition.
Am I missing something fundamental here?

Sorry to insist, but I have to answer this very simple question:  What
did happen here?

I'm sure you can understand my situation here.

Thank you in advance for your help,

Regards,

Youssef

-Original Message-
From: pacemaker-requ...@oss.clusterlabs.org
[mailto:pacemaker-requ...@oss.clusterlabs.org] 
Sent: Friday, December 14, 2012 5:37 AM
To: pacemaker@oss.clusterlabs.org
Subject: Pacemaker Digest, Vol 61, Issue 37

Send Pacemaker mailing list submissions to
pacemaker@oss.clusterlabs.org

To subscribe or unsubscribe via the World Wide Web, visit
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
or, via email, send a message with subject or body 'help' to
pacemaker-requ...@oss.clusterlabs.org

You can reach the person managing the list at
pacemaker-ow...@oss.clusterlabs.org

When replying, please edit your Subject line so it is more specific than
Re: Contents of Pacemaker digest...


Today's Topics:

   1. Re: Action from a different CRMD transition results in
  restarting services (Andrew Beekhof)
   2. Re: problem with float IP with pacemaker (Andrew Beekhof)
   3. cman+qdisk+pacemaker - pacemaker qdisk node offline (Rob)
   4. Re: booth is the state of started on pacemaker before booth
  write ticket info in cib. (Jiaju Zhang)
   5. Pacemaker stop behaviour when underlying resource is
  unavailable (pavan tc)


--

Message: 1
Date: Fri, 14 Dec 2012 13:32:32 +1100
From: Andrew Beekhof and...@beekhof.net
To: The Pacemaker cluster resource manager
pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] Action from a different CRMD transition
results in restarting services
Message-ID:

CAEDLWG0gzrt0w__tsZKbeELXwdaOHi9KGj_Oxm0877kMxgP=b...@mail.gmail.com
Content-Type: text/plain; charset=ISO-8859-1

On Fri, Dec 14, 2012 at 1:33 AM, Latrous, Youssef
ylatr...@broadviewnet.com wrote:

 Andrew Beekhof and...@beekhof.net wrote:
 18014 is where we're up to now, 16048 is the (old) one that scheduled
 the recurring monitor operation.
 I suspect you'll find the action failed earlier in the logs and thats
 why it needed to be restarted.

 Not the best log message though :(

 Thanks Andrew for the quick answer. I still need more info if
possible.

 I searched everywhere for transaction 16048 and I couldn't find a 
 trace of it (looked for up to 5 days of logs prior to transaction
18014).
 It would have been good if we had timestamps for each transaction 
 involved in this situation :-)

 Is there a way to find about this old transaction in any other logs (I

 looked into /var/log/messages on both nodes involved in this cluster)?

Its not really relevant.
The only important thing is that its not one we're currently executing.

What you should care about is any logs that hopefully show you why the
resource failed at around Dec  6 22:55:05.


 To give you an idea of how many transactions happened during this
 period:
TR_ID 18010 @ 21:52:16
...
TR_ID 18018 @ 22:55:25

 Over an hour between these two events.

 Given this, how come such a (very) old transaction (~2000 transactions

 before current one) only acts now? Could it be stale information in 
 pacemaker?

No. It hasn't only just acted now. Its been repeating over and over for
the last few weeks or so.
The difference is that this time it failed.


 Thanks in advance.

 Youssef
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


End of Pacemaker Digest, Vol 61, Issue 37
*

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org

Re: [Pacemaker] wrong device in stonith_admin -l

2012-12-17 Thread David Vossel


- Original Message -
 From: laurent+pacema...@u-picardie.fr
 To: pacemaker@oss.clusterlabs.org
 Sent: Tuesday, December 11, 2012 6:51:20 PM
 Subject: [Pacemaker] wrong device in stonith_admin -l
 
 
 Hi,
 
 I've just observed something weird.
 A node is running a stonith resource for which gethosts gives an
 empty
 node list. The result of stonith_admin -l does include it in the
 device list !
 
 result of stonith_admin -l elasticsearch-05 run from
 elasticsearch-06 :
  stonith-xen-peatbull
  stonith-xen-eddu
 2 devices found
 
 stonith-xen-peatbull is a correct fencing device
 stonith-xen-eddu is a fencing device with an empty hostlist
 
 running my-xen0 gethosts with the stonith-xen-eddu params by hand
 doesn't return any host, and it does exit with 0 (is that correct to
 return 0 with an empty host list ?)

 
 logs :
 Dec 12 01:09:10 elasticsearch-06 stonith-ng[18181]:   notice:
 stonith_device_register: Added 'stonith-cluster-xen' to the device
 list (6 active devices)
 Dec 12 01:09:10 elasticsearch-06 attrd[18183]:   notice:
 attrd_trigger_update: Sending flush op to all hosts for:
 probe_complete (true)
 Dec 12 01:09:10 elasticsearch-06 attrd[18183]:   notice:
 attrd_perform_update: Sent update 5: probe_complete=true
 Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]:   notice:
 stonith_device_register: Added 'stonith-xen-eddu' to the device list
 (6 active devices)
 Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]:   notice:
 stonith_device_register: Added 'stonith-xen-peatbull' to the device
 list (6 active devices)
 Dec 12 01:09:12 elasticsearch-06 stonith: [18434]: info:
 external/my-xen0-ha device OK.
 Dec 12 01:09:12 elasticsearch-06 crmd[18185]:   notice:
 process_lrm_event: LRM operation stonith-cluster-xen_start_0
 (call=61,rc=0, cib-update=27, confirmed=true) ok
 Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info:
 external_run_cmd: '/usr/lib/stonith/plugins/external/my-xen0 status'
 output: elasticsearch-05
 Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info:
 external_run_cmd: '/usr/lib/stonith/plugins/external/my-xen0 status'
 output: elasticsearch-06
 Dec 12 01:09:15 elasticsearch-06 stonith: [18465]: info:
 external/my-xen0 device OK.
 Dec 12 01:09:15 elasticsearch-06 crmd[18185]:   notice:
 process_lrm_event: LRM operation stonith-xen-peatbull_start_0
 (call=68, rc=0, cib-update=28, confirmed=true) ok
 Dec 12 01:09:15 elasticsearch-06 stonith: [18458]: info:
 external/my-xen0 device OK.
 Dec 12 01:09:15 elasticsearch-06 crmd[18185]:   notice:
 process_lrm_event: LRM operation stonith-xen-eddu_start_0 (call=66,
 rc=0, cib-update=29, confirmed=true) ok
 Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]:   notice:
 dynamic_list_search_cb: Disabling port list queries for
 stonith-xen-kornog (1): (null)
 Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]:   notice:
 dynamic_list_search_cb: Disabling port list queries for
 stonith-xen-nikka (1): (null)
 Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]:   notice:
 dynamic_list_search_cb: Disabling port list queries for
 stonith-xen-yoichi (1): (null)
 Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: CRIT:
 external_hostlist: 'my-xen0 gethosts' returned an empty hostlist
 Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: ERROR: Could not
 list hosts for external/my-xen0.
 Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: CRIT:
 external_hostlist: 'my-xen0 gethosts' returned an empty hostlist
 Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: ERROR: Could not
 list hosts for external/my-xen0.
 Dec 12 01:12:37 elasticsearch-06 stonith-ng[18181]:   notice:
 dynamic_list_search_cb: Disabling port list queries for
 stonith-xen-eddu (1): failed:  255

We discover what hosts a agent can fence by running this command internally in 
stonith.

# agent -o list

From there we expect a exit-code of 0 and the list of node to be in the output.
https://fedorahosted.org/cluster/wiki/FenceAgentAPI

Looking at your logs, stonith-xen-eddu is returning -1 (255) as the return code 
when we issue the 'list' action.  That means we don't try to get the dynamic 
list again, we assume the 'list' action isn't supported. From there we fall 
back to using the 'status' action to dynamically determine if agent can fence a 
particular host.  I'm guessing the 'status' action is returning true (return 
codes 0 or 2) for hosts you wouldn't expect the agent to be able to fence for 
some reason.

-- Vossel

 
 David, I mentioned a node being wrongly fenced in the
 stonith-timeout
 duration 0 is too low bug, could it be related ?
 
 
 --
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org

[Pacemaker] Patrik Rapposch is out of the office

2012-12-17 Thread Patrik . Rapposch

Ich werde ab  17.12.2012 nicht im Büro sein. Ich kehre zurück am
19.12.2012.

Please note, that I am not available. Please always use
ksi.netw...@knapp.com, which ensures that one of our network
adminsitrators takes care of your interest.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Multi-state slave resource promoted when node was not quorate, expected?

2012-12-17 Thread Jesse Hathaway
We had a switch failure and all the nodes were partitioned. The slave node
promoted its resource while it did not have quorum. We have
no-quourm-policy set to freeze. Is it expected for resource promotion to
occur when a node does not have quorum?

-- 
Jesse Hathaway, Systems Engineer
Braintree http://getbraintree.com
917-418-8423
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Ordered resource is not restarting after migration if it's already started on new host

2012-12-17 Thread Neal Peters

On Dec 16, 2012, at 7:29 PM, pacemaker-requ...@oss.clusterlabs.org wrote:

 Message: 5
 Date: Mon, 17 Dec 2012 14:23:15 +1100
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager
   pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Ordered resource is not restarting after
   migration if it's already started on new host
 Message-ID:
   caedlwg35tfnghmm_fussxedryamss5owfxrdlg5ytcmj7yx...@mail.gmail.com
 Content-Type: text/plain; charset=ISO-8859-1
 
 On Sat, Dec 15, 2012 at 10:58 AM, Neal Peters nealppet...@gmail.com wrote:
 Hello-
 
 I'm running Pacemaker v. 1.1 (pacemaker-1.1.7-6.el6.x86_64) on CentOS 6.3 
 and am observing behavior on my systems that differs from the behavior 
 described in the manual.
 
 Basically, the desired behavior (and the behavior described in Pacemaker 
 Explained Section 6.3.1) is that when a first resource in an ordered set 
 is moved to a host where the then resource is already running, the then 
 resource will be restarted.
 
 From Pacemaker Explained 6.3.1 Mandatory Ordering:
 -If the first resource is (re)started while the then resource is running, 
 the then resource will be stopped and restarted.
 
 I am not seeing this behavior however.  I am seeing that the then resource 
 is left running.
 
 
 I have 2 servers running a fairly basic setup that is fairly close to the 
 one described in the Clusters from Scratch document. Config follows:
 
 node host2
 node host1
 primitive ClusterIP ocf:heartbeat:IPaddr2 \
params ip=192.168.0.225 cidr_netmask=32 \
op monitor interval=1s \
meta target-role=Started
 primitive DNSserver lsb:named \
op monitor interval=1s
 colocation ip-with-DNSserver inf: DNSserver ClusterIP
 order DNS-server-after-ip inf: ClusterIP DNSserver
 property $id=cib-bootstrap-options \
dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \
cluster-infrastructure=openais \
expected-quorum-votes=2 \
stonith-enabled=false \
no-quorum-policy=ignore \
last-lrm-refresh=1355268791
 rsc_defaults $id=rsc-options \
resource-stickiness=102
 
 When the DNSserver resource is migrated from one node to the other and named 
 is already started on the other node (for whatever reason), named is not 
 restarted
 
 1) Ordering constraints are behaving as expected, DNSserver is started
 after ClusterIP
 2) Starting something (DNSserver) that is already started is a no-op
 3) Don't start cluster services outside of the cluster
 
 3 is the root problem in your case

Thank you for your prompt reply.  It sounds as though Pacemaker is operating in 
the way that you expect in this situation.

Your description of Pacemaker behavior
 2) Starting something (DNSserver) that is already started is a no-op


differs from behavior described in the documentation
 -If the first resource is (re)started while the then resource is running, 
 the then resource will be stopped and restarted.
( 
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-resource-ordering.html#_mandatory_ordering
 Section 6.3.1)

Is there a place that I can/should report this discrepancy between actual 
behavior and behavior described in the documentation?

Thank you.


 
 
 Dec 14 15:32:28 host1 snmpd[5296]: Connection from UDP: 
 [192.168.0.129]:51000-[192.168.0.93]
 Dec 14 15:32:40 host1 lrmd: [8733]: info: rsc:ClusterIP:5: start
 Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: ip -f inet addr add 
 192.168.0.225/32 brd 192.168.0.225 dev eth1
 Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: ip link set eth1 up
 Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: 
 /usr/lib64/heartbeat/send_arp -i 200 -r 5 -p /var/run/heartbeat/rsctmp/se
 nd_arp-192.168.0.225 eth1 192.168.0.225 auto not_used not_used
 Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation 
 ClusterIP_start_0 (call=5, rc=0, cib-update=10, co
 nfirmed=true) ok
 Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:ClusterIP:6: monitor
 Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:DNSserver:7: start
 Dec 14 15:32:41 host1 lrmd: [9601]: WARN: For LSB init script, no additional 
 parameters are needed.
 Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: 
 (DNSserver:start:stdout) Starting named:
 Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: 
 (DNSserver:start:stdout) named: already running
 Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: 
 (DNSserver:start:stdout) [  OK
 Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: 
 (DNSserver:start:stdout) ]#015
 Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout)
 Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation 
 DNSserver_start_0 (call=7, rc=0, cib-update=11, co
 nfirmed=true) ok
 Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:DNSserver:8: monitor
 Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation 
 ClusterIP_monitor_1000 

[Pacemaker] reloading crm changes

2012-12-17 Thread Paul Shannon - NOAA Federal
I'm just getting our cluster set up and seem to be missing something about
changes made using the crm program. I added some resources and groups using
crm = configure = edit.  After saving and committing my changes I can see
the new resources in resource = show but they are stopped.  After running
start resource  they are still stopped.  Also, exiting and running
crm_mon does *not* show the new resources.  I tried a  clean resource
just in case, but that did not change anything either.

I thought the whole idea of the live resources was they took effect
immediately. Am I missing a step?

Paul Shannon
-
Speak the truth, but leave immediately after. - Slovenian proverb**
*
*Paul Shannon paul.shan...@noaa.gov
ITO, WFO Juneau
NOAA, National Weather Service
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of started on pacemaker before booth write ticket info in cib.

2012-12-17 Thread Jiaju Zhang
On Mon, 2012-12-17 at 10:40 +0900, Yuichi SEINO wrote:
 Hi Jiaju,
 
  
   Perhaps,  this problem didn't happen before the following commit.
   https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f
  
   Currently when all of the initialization (including loading the new
   ticket information) finished, booth should be regarded as ready. So if
   you encounter some problem here, I guess we should improve the RA to
   better reflect the booth startup status, but not moving the
   initialization order, since it may introduce other regression as we have
   encountered before;)
  
 
  I am not still sure which we should fix RA or booth.
 
  I suggest to add a new function to clear the old ticket info in the CIB,
  and call that function when booth just run but before deamonized. So,
  before booth_start in the RA returned, the stale data has been cleared.
  What do you think about this?;)
 
 
 In the case of using cib info, Can you implement it? For example,
 booth is fail-over on local. Then, booth need to get the ticket in
 cib. If there is no this problem, I can agree to it.

OK, I'll implement it;)

Thanks,
Jiaju



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Mysql configuration

2012-12-17 Thread codey koble
I'm having a bit of trouble with setting up the master/slave mysql
configuration with pacemaker. Using ubuntu 10.04LTS with the most recent
resource agent package from:
https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa. When I check the
status in pacemaker my first node is successfully showing as a master and
the second as a slave, and upon checking mysql this is true, but the slave
is not correctly set up with the master as the log file and position are
incorrect so it is not picking up any changes from the master. I noticed in
the config two lines are being added automatically to the node attributes
for the slave specifying the file and position but they are incorrect.
Where are these generated from or how can I configure things to properly
detect the master log file and position?
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] reloading crm changes

2012-12-17 Thread Andreas Kurz
On 12/17/2012 11:29 PM, Paul Shannon - NOAA Federal wrote:
 I'm just getting our cluster set up and seem to be missing something
 about changes made using the crm program. I added some resources and
 groups using crm = configure = edit.  After saving and committing my
 changes I can see the new resources in resource = show but they are
 stopped.  After running  start resource  they are still stopped. 
 Also, exiting and running crm_mon does *not* show the new resources.  I
 tried a  clean resource  just in case, but that did not change
 anything either. 

By default stonith is enabled  you have configured a
stonith-resource? If not, resource management is disabled until you do
... or disable stonith ... and you need quorum if you don't ignore it 

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 I thought the whole idea of the live resources was they took effect
 immediately. Am I missing a step?
 
 Paul Shannon
 -
 Speak the truth, but leave immediately after. - Slovenian proverb//
 /
 /Paul Shannon paul.shan...@noaa.gov mailto:paul.shan...@noaa.gov
 ITO, WFO Juneau
 NOAA, National Weather Service
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] reloading crm changes

2012-12-17 Thread Paul Shannon - NOAA Federal
Andreas,

I do have  no-quorum-policy=ignore  set and  stonith-enabled=false.
Also, I do have some resources running.  Its just when I tried to add
another one that I cannot get it to take.

Paul

-
Speak the truth, but leave immediately after. - Slovenian proverb**
*
*Paul Shannon paul.shan...@noaa.gov
ITO, WFO Juneau
NOAA, National Weather Service




On Mon, Dec 17, 2012 at 11:22 PM, Andreas Kurz andr...@hastexo.com wrote:

 On 12/17/2012 11:29 PM, Paul Shannon - NOAA Federal wrote:
  I'm just getting our cluster set up and seem to be missing something
  about changes made using the crm program. I added some resources and
  groups using crm = configure = edit.  After saving and committing my
  changes I can see the new resources in resource = show but they are
  stopped.  After running  start resource  they are still stopped.
  Also, exiting and running crm_mon does *not* show the new resources.  I
  tried a  clean resource  just in case, but that did not change
  anything either.

 By default stonith is enabled  you have configured a
 stonith-resource? If not, resource management is disabled until you do
 ... or disable stonith ... and you need quorum if you don't ignore it 

 Regards,
 Andreas

 --
 Need help with Pacemaker?
 http://www.hastexo.com/now

 
  I thought the whole idea of the live resources was they took effect
  immediately. Am I missing a step?
 
  Paul Shannon
  -
  Speak the truth, but leave immediately after. - Slovenian proverb//
  /
  /Paul Shannon paul.shan...@noaa.gov mailto:paul.shan...@noaa.gov
  ITO, WFO Juneau
  NOAA, National Weather Service
 
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 




 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] reloading crm changes

2012-12-17 Thread Andreas Kurz
On 12/18/2012 12:58 AM, Paul Shannon - NOAA Federal wrote:
 Andreas,
 
 I do have  no-quorum-policy=ignore  set and  stonith-enabled=false.
 Also, I do have some resources running.  Its just when I tried to add
 another one that I cannot get it to take.

what does crm_mon -1frA show?  and of course logs should give all
information needed ...

Regards,
Andreas

 
 Paul
 
 -
 Speak the truth, but leave immediately after. - Slovenian proverb//
 /
 /Paul Shannon paul.shan...@noaa.gov mailto:paul.shan...@noaa.gov
 ITO, WFO Juneau
 NOAA, National Weather Service
 
 
 
 
 On Mon, Dec 17, 2012 at 11:22 PM, Andreas Kurz andr...@hastexo.com
 mailto:andr...@hastexo.com wrote:
 
 On 12/17/2012 11:29 PM, Paul Shannon - NOAA Federal wrote:
  I'm just getting our cluster set up and seem to be missing something
  about changes made using the crm program. I added some resources and
  groups using crm = configure = edit.  After saving and committing my
  changes I can see the new resources in resource = show but they are
  stopped.  After running  start resource  they are still stopped.
  Also, exiting and running crm_mon does *not* show the new
 resources.  I
  tried a  clean resource  just in case, but that did not change
  anything either.
 
 By default stonith is enabled  you have configured a
 stonith-resource? If not, resource management is disabled until you do
 ... or disable stonith ... and you need quorum if you don't ignore
 it 
 
 Regards,
 Andreas
 
 --
 Need help with Pacemaker?
 http://www.hastexo.com/now
 
 
  I thought the whole idea of the live resources was they took effect
  immediately. Am I missing a step?
 
  Paul Shannon
  -
  Speak the truth, but leave immediately after. - Slovenian proverb//
  /
  /Paul Shannon paul.shan...@noaa.gov
 mailto:paul.shan...@noaa.gov mailto:paul.shan...@noaa.gov
 mailto:paul.shan...@noaa.gov
  ITO, WFO Juneau
  NOAA, National Weather Service
 
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 mailto:Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 mailto:Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Ordered resource is not restarting after migration if it's already started on new host

2012-12-17 Thread Andrew Beekhof
On Tue, Dec 18, 2012 at 6:28 AM, Neal Peters nealppet...@gmail.com wrote:

 On Dec 16, 2012, at 7:29 PM, pacemaker-requ...@oss.clusterlabs.org wrote:

 Message: 5
 Date: Mon, 17 Dec 2012 14:23:15 +1100
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Ordered resource is not restarting after
 migration if it's already started on new host
 Message-ID:
 caedlwg35tfnghmm_fussxedryamss5owfxrdlg5ytcmj7yx...@mail.gmail.com
 Content-Type: text/plain; charset=ISO-8859-1


 On Sat, Dec 15, 2012 at 10:58 AM, Neal Peters nealppet...@gmail.com wrote:

 Hello-


 I'm running Pacemaker v. 1.1 (pacemaker-1.1.7-6.el6.x86_64) on CentOS 6.3
 and am observing behavior on my systems that differs from the behavior
 described in the manual.


 Basically, the desired behavior (and the behavior described in Pacemaker
 Explained Section 6.3.1) is that when a first resource in an ordered set
 is moved to a host where the then resource is already running, the then
 resource will be restarted.


 From Pacemaker Explained 6.3.1 Mandatory Ordering:

 -If the first resource is (re)started while the then resource is running,
 the then resource will be stopped and restarted.


 I am not seeing this behavior however.  I am seeing that the then resource
 is left running.



 I have 2 servers running a fairly basic setup that is fairly close to the
 one described in the Clusters from Scratch document. Config follows:


 node host2

 node host1

 primitive ClusterIP ocf:heartbeat:IPaddr2 \

params ip=192.168.0.225 cidr_netmask=32 \

op monitor interval=1s \

meta target-role=Started

 primitive DNSserver lsb:named \

op monitor interval=1s

 colocation ip-with-DNSserver inf: DNSserver ClusterIP

 order DNS-server-after-ip inf: ClusterIP DNSserver

 property $id=cib-bootstrap-options \

dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \

cluster-infrastructure=openais \

expected-quorum-votes=2 \

stonith-enabled=false \

no-quorum-policy=ignore \

last-lrm-refresh=1355268791

 rsc_defaults $id=rsc-options \

resource-stickiness=102


 When the DNSserver resource is migrated from one node to the other and named
 is already started on the other node (for whatever reason), named is not
 restarted


 1) Ordering constraints are behaving as expected, DNSserver is started
 after ClusterIP
 2) Starting something (DNSserver) that is already started is a no-op
 3) Don't start cluster services outside of the cluster

 3 is the root problem in your case


 Thank you for your prompt reply.  It sounds as though Pacemaker is operating
 in the way that you expect in this situation.

 Your description of Pacemaker behavior

 2) Starting something (DNSserver) that is already started is a no-op


 differs from behavior described in the documentation

No, it doesn't.
The cluster _is_ trying to start the resource (we stopped it on the
old host and are trying to start it on the new one), however the named
init script is simply ignoring the request because named is already
running.

Also this behaviour by the named script is mandated by the LSB standard.
Which is why I said #3 was the problem you need to fix


 -If the first resource is (re)started while the then resource is running,
 the then resource will be stopped and restarted.

 (
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-resource-ordering.html#_mandatory_ordering
 Section 6.3.1)

 Is there a place that I can/should report this discrepancy between actual
 behavior and behavior described in the documentation?

 Thank you.




 Dec 14 15:32:28 host1 snmpd[5296]: Connection from UDP:
 [192.168.0.129]:51000-[192.168.0.93]

 Dec 14 15:32:40 host1 lrmd: [8733]: info: rsc:ClusterIP:5: start

 Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: ip -f inet addr add
 192.168.0.225/32 brd 192.168.0.225 dev eth1

 Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: ip link set eth1 up

 Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO:
 /usr/lib64/heartbeat/send_arp -i 200 -r 5 -p /var/run/heartbeat/rsctmp/se

 nd_arp-192.168.0.225 eth1 192.168.0.225 auto not_used not_used

 Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation
 ClusterIP_start_0 (call=5, rc=0, cib-update=10, co

 nfirmed=true) ok

 Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:ClusterIP:6: monitor

 Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:DNSserver:7: start

 Dec 14 15:32:41 host1 lrmd: [9601]: WARN: For LSB init script, no additional
 parameters are needed.

 Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output:
 (DNSserver:start:stdout) Starting named:

 Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output:
 (DNSserver:start:stdout) named: already running

 Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output:
 (DNSserver:start:stdout) [  OK

 Dec 14 15:32:41 host1 lrmd: [8733]: info: 

Re: [Pacemaker] Multi-state slave resource promoted when node was not quorate, expected?

2012-12-17 Thread Andrew Beekhof
On Tue, Dec 18, 2012 at 5:55 AM, Jesse Hathaway
jesse.hatha...@getbraintree.com wrote:
 We had a switch failure and all the nodes were partitioned. The slave node
 promoted its resource while it did not have quorum. We have no-quourm-policy
 set to freeze. Is it expected for resource promotion to occur when a node
 does not have quorum?

No. That sounds like a bug.  Can you attach a crm_report tarball to a
bugzilla entry please?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Action from a different CRMD transition results in

2012-12-17 Thread Andrew Beekhof
On Tue, Dec 18, 2012 at 1:39 AM, Latrous, Youssef
ylatr...@broadviewnet.com wrote:
 Hi Andrew,

 Thank you for following up.

 I still don't see what went wrong. From the logs, RabbitMQ was working
 just fine around that time until it was ordered to shut down by CRM (for
 the failed monitor?).

Apparently not, otherwise the monitor would not have reported a failure.
Something went wrong, either in the resource script or the RabbitMQ itself.


 Moreover, I assume that transitions are ordered monotonically, which
 means that Transition ID 16048 happened before Transition ID 18014:
   16048  18014

 According to the logs, Transition ID 16048 wasn't present in the logs
 dating several days before transition ID 18014 was generated. I'll then
 assume that it was generated several days ago (if not true, please give
 me a way of finding out when did this transition happen - I still
 believe that time is of essence in this case). Our monitor command
 timers are expressed in seconds.

 In that case, how can we say:
It hasn't only just acted now. Its been repeating over and over for
 the last few weeks or so.

Because thats how its designed, thats what recurring monitors do, the
lrmd schedules them to run over and over every N seconds and the lrmd
lets us know when something changes.


 My understanding is that a transition happens once and only once: it
 succeeds, fails or is aborted altogether.

No.

 Corresponding events can
 repeat over and over, but each time can only be part a new transition.
 Am I missing something fundamental here?

Yes.  See above.


 Sorry to insist, but I have to answer this very simple question:  What
 did happen here?

Your resource or resource agent had a problem.
More than that I can't say because I don't have access to your logs.


 I'm sure you can understand my situation here.

 Thank you in advance for your help,

 Regards,

 Youssef

 -Original Message-
 From: pacemaker-requ...@oss.clusterlabs.org
 [mailto:pacemaker-requ...@oss.clusterlabs.org]
 Sent: Friday, December 14, 2012 5:37 AM
 To: pacemaker@oss.clusterlabs.org
 Subject: Pacemaker Digest, Vol 61, Issue 37

 Send Pacemaker mailing list submissions to
 pacemaker@oss.clusterlabs.org

 To subscribe or unsubscribe via the World Wide Web, visit
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 or, via email, send a message with subject or body 'help' to
 pacemaker-requ...@oss.clusterlabs.org

 You can reach the person managing the list at
 pacemaker-ow...@oss.clusterlabs.org

 When replying, please edit your Subject line so it is more specific than
 Re: Contents of Pacemaker digest...


 Today's Topics:

1. Re: Action from a different CRMD transition results in
   restarting services (Andrew Beekhof)
2. Re: problem with float IP with pacemaker (Andrew Beekhof)
3. cman+qdisk+pacemaker - pacemaker qdisk node offline (Rob)
4. Re: booth is the state of started on pacemaker before booth
   write ticket info in cib. (Jiaju Zhang)
5. Pacemaker stop behaviour when underlying resource is
   unavailable (pavan tc)


 --

 Message: 1
 Date: Fri, 14 Dec 2012 13:32:32 +1100
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Action from a different CRMD transition
 results in restarting services
 Message-ID:

 CAEDLWG0gzrt0w__tsZKbeELXwdaOHi9KGj_Oxm0877kMxgP=b...@mail.gmail.com
 Content-Type: text/plain; charset=ISO-8859-1

 On Fri, Dec 14, 2012 at 1:33 AM, Latrous, Youssef
 ylatr...@broadviewnet.com wrote:

 Andrew Beekhof and...@beekhof.net wrote:
 18014 is where we're up to now, 16048 is the (old) one that scheduled
 the recurring monitor operation.
 I suspect you'll find the action failed earlier in the logs and thats
 why it needed to be restarted.

 Not the best log message though :(

 Thanks Andrew for the quick answer. I still need more info if
 possible.

 I searched everywhere for transaction 16048 and I couldn't find a
 trace of it (looked for up to 5 days of logs prior to transaction
 18014).
 It would have been good if we had timestamps for each transaction
 involved in this situation :-)

 Is there a way to find about this old transaction in any other logs (I

 looked into /var/log/messages on both nodes involved in this cluster)?

 Its not really relevant.
 The only important thing is that its not one we're currently executing.

 What you should care about is any logs that hopefully show you why the
 resource failed at around Dec  6 22:55:05.


 To give you an idea of how many transactions happened during this
 period:
TR_ID 18010 @ 21:52:16
...
TR_ID 18018 @ 22:55:25

 Over an hour between these two events.

 Given this, how come such a (very) old transaction (~2000 transactions

 before current one) only acts now? Could it be stale information in
 

Re: [Pacemaker] Pacemaker stop behaviour when underlying resource is unavailable

2012-12-17 Thread Andrew Beekhof
On Fri, Dec 14, 2012 at 9:32 PM, pavan tc pavan...@gmail.com wrote:
 Hi,

 I have structured my multi-state resource agent as below when the underlying
 resource becomes unavailable for some reason:

 monitor()
 {
 state=get_primitive_resource_state()

 ...
 ...
 if ($state == unavailable)
return $OCF_NOT_RUNNING

 ...
 ...
 }

 stop()
 {
 monitor()
 ret=$?

 if (ret == $OCF_NOT_RUNNING)
return $OCF_SUCCESS
 }

 start()
 {
 start_primitive()
 if (start_primitive_failure)
 return OCF_ERR_GENERIC
 }

 The idea is to make sure that stop does not fail when the underlying
 resource goes away.
 (Otherwise I see that the resource gets to an unmanaged state)
 Also, the expectation is that when the resource comes back, it joins the
 cluster without much fuss.

 What I see is that pacemaker calls stop twice

That would not be expected. Bug?

 and if it finds that stop
 returns success,
 it does not continue with monitor any more. I also do not see an attempt to
 start.

Anywhere?  Or just on the same node?


 Is there a way to keep the monitor going in such circumstances?

Not really. You can define a recurring monitor for the Stopped role though.
But why would it come back?  You _really_ should not be starting
services outside of the cluster - not least of all because we've
probably started it somewhere else in the meantime.

 Am I using incorrect resource agent return codes?

 Thanks,
 Pavan

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker stop behaviour when underlying resource is unavailable

2012-12-17 Thread pavan tc
[..]

 The idea is to make sure that stop does not fail when the underlying

  resource goes away.
  (Otherwise I see that the resource gets to an unmanaged state)
  Also, the expectation is that when the resource comes back, it joins the
  cluster without much fuss.
 
  What I see is that pacemaker calls stop twice

 That would not be expected. Bug?


Are you pointing at stop getting called 'twice'? If yes, I will confirm
once more about
the behaviour and will raise a bug.



  and if it finds that stop
  returns success,
  it does not continue with monitor any more. I also do not see an attempt
 to
  start.

 Anywhere?  Or just on the same node?


On the same node. The resource does get promoted on the other node.
My expectation was that if I kept returning OCF_NOT_RUNNING in monitor,
then it should attempt a start-stop-monitor cycle till the resource came
back.
It seems this is not what the cluster manager does?


  Is there a way to keep the monitor going in such circumstances?

 Not really. You can define a recurring monitor for the Stopped role though.


I did not want to go there if I could achieve it via the usual mechanisms.
If that is not, possible, I will explore this option in more detail.

But why would it come back?  You _really_ should not be starting
 services outside of the cluster - not least of all because we've
 probably started it somewhere else in the meantime.


Even if we started the resource elsewhere, we are running in degraded mode.
(My bad, I did not mention this is a _two-node_ multi-state resource).
We would like to come back to the available mode as early as possible and
with the least amount of manual intervention with the cluster.

Pavan


  Am I using incorrect resource agent return codes?
 
  Thanks,
  Pavan
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker stop behaviour when underlying resource is unavailable

2012-12-17 Thread Andrew Beekhof
On Tue, Dec 18, 2012 at 4:24 PM, pavan tc pavan...@gmail.com wrote:
 [..]


 The idea is to make sure that stop does not fail when the underlying

  resource goes away.
  (Otherwise I see that the resource gets to an unmanaged state)
  Also, the expectation is that when the resource comes back, it joins the
  cluster without much fuss.
 
  What I see is that pacemaker calls stop twice

 That would not be expected. Bug?


 Are you pointing at stop getting called 'twice'?

Correct

 If yes, I will confirm once
 more about
 the behaviour and will raise a bug.



  and if it finds that stop
  returns success,
  it does not continue with monitor any more. I also do not see an attempt
  to
  start.

 Anywhere?  Or just on the same node?


 On the same node. The resource does get promoted on the other node.
 My expectation was that if I kept returning OCF_NOT_RUNNING in monitor,
 then it should attempt a start-stop-monitor cycle till the resource came
 back.
 It seems this is not what the cluster manager does?

Not always, it very much depends on the constraints you've defined and
things like migration-threshold.


 
  Is there a way to keep the monitor going in such circumstances?

 Not really. You can define a recurring monitor for the Stopped role
 though.


 I did not want to go there if I could achieve it via the usual mechanisms.

If you want to monitor a resource on a node that its not running on,
that _is_ the usual mechanism.
The thing is that it's an unusual thing to want to do.

 If that is not, possible, I will explore this option in more detail.

 But why would it come back?  You _really_ should not be starting
 services outside of the cluster - not least of all because we've
 probably started it somewhere else in the meantime.


 Even if we started the resource elsewhere, we are running in degraded mode.

Not on the node for which you returned stopped.
There you are just flat-out not running at all.

 (My bad, I did not mention this is a _two-node_ multi-state resource).
 We would like to come back to the available mode as early as possible and
 with the least amount of manual intervention with the cluster.

Normally I wouldn't expect any manual intervention either, but I
really can't comment further without seeing logs and configs.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org