Re: [Pacemaker] Installing pacemaker on aws ec2 server
I fix this error using LIBS enviroment variable I run: export LIBS=/lib64/libtinfo.so.5 then ./configure again and then make completed successfully On Mon, Dec 17, 2012 at 9:02 AM, Yossi Nachum nachum...@gmail.com wrote: Hi, I am trying to install pacemaker on amazon ec2 ami instance. I tried to install using the packages from pacemaker repository but had many missing dependencies to I tried to compile from source. I downlad the source using git run ./autogen.sh and configure successfully but when I tried to make I get the following error: make[1]: Entering directory `/usr/local/src/pacemaker/tools' CCLD crm_mon /usr/bin/ld: crm_mon.o: undefined reference to symbol 'cbreak' /usr/bin/ld: note: 'cbreak' is defined in DSO /lib64/libtinfo.so.5 so try adding it to the linker command line /lib64/libtinfo.so.5: could not read symbols: Invalid operation collect2: ld returned 1 exit status make[1]: *** [crm_mon] Error 1 make[1]: Leaving directory `/usr/local/src/pacemaker/tools' make: *** [core] Error 1 I tried to google it but didn't find a solution or I don't know how to add /lib64/libtinfo.so.5 to the linker command can anyone help? Yossi ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] HA FTP Server in aws vpc
Hi, I want to run ftp server in active passive mode in amazon aws environment. I use a vpc and two subnets: ftp-1 is on 192.168.10.x and ftp-2 is on 192.168.20.x The two subnets are in different availability zones. In this configuration I don't see how can I use a vip so I thought of creating an init script that change the DNS record when one server become the active server. what do you think? does anyone have more elgant solution for this? Thanks Yossi ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Improvement for the communication failure of booth
Hi, Jiaju I would like to attach the function which displays a communicative state on booth. In the present booth, when communication between sites stops service, no errors are told. If it becomes like this, the user cannot notice a problem. I think that he would like to define newly the variable which saves the communication state of paxos, in order to solve this problem. I want to display on the client command, and its state. Is this thought realistic? Are there any other good idea? Regards, Yusuke -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] HA FTP Server in aws vpc
Have you thought about using a load balancer instead of a VIP? The ELB can span subnets. -- Art Z. -Original Message- From: Yossi Nachum nachum...@gmail.com Sent: Monday, December 17, 2012 2:22am To: pacemaker@oss.clusterlabs.org Subject: [Pacemaker] HA FTP Server in aws vpc Hi, I want to run ftp server in active passive mode in amazon aws environment. I use a vpc and two subnets: ftp-1 is on 192.168.10.x and ftp-2 is on 192.168.20.x The two subnets are in different availability zones. In this configuration I don't see how can I use a vip so I thought of creating an init script that change the DNS record when one server become the active server. what do you think? does anyone have more elgant solution for this? Thanks Yossi ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] HA FTP Server in aws vpc
I can't use ELB with ftp port. The ports that ELB can listen to are: 25, 80, 443 or 1024-65535 On Mon, Dec 17, 2012 at 3:25 PM, Art Zemon a...@hens-teeth.net wrote: Have you thought about using a load balancer instead of a VIP? The ELB can span subnets. -- Art Z. -Original Message- From: Yossi Nachum nachum...@gmail.com Sent: Monday, December 17, 2012 2:22am To: pacemaker@oss.clusterlabs.org Subject: [Pacemaker] HA FTP Server in aws vpc Hi, I want to run ftp server in active passive mode in amazon aws environment. I use a vpc and two subnets: ftp-1 is on 192.168.10.x and ftp-2 is on 192.168.20.x The two subnets are in different availability zones. In this configuration I don't see how can I use a vip so I thought of creating an init script that change the DNS record when one server become the active server. what do you think? does anyone have more elgant solution for this? Thanks Yossi ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Action from a different CRMD transition results in
Hi Andrew, Thank you for following up. I still don't see what went wrong. From the logs, RabbitMQ was working just fine around that time until it was ordered to shut down by CRM (for the failed monitor?). Moreover, I assume that transitions are ordered monotonically, which means that Transition ID 16048 happened before Transition ID 18014: 16048 18014 According to the logs, Transition ID 16048 wasn't present in the logs dating several days before transition ID 18014 was generated. I'll then assume that it was generated several days ago (if not true, please give me a way of finding out when did this transition happen - I still believe that time is of essence in this case). Our monitor command timers are expressed in seconds. In that case, how can we say: It hasn't only just acted now. Its been repeating over and over for the last few weeks or so. My understanding is that a transition happens once and only once: it succeeds, fails or is aborted altogether. Corresponding events can repeat over and over, but each time can only be part a new transition. Am I missing something fundamental here? Sorry to insist, but I have to answer this very simple question: What did happen here? I'm sure you can understand my situation here. Thank you in advance for your help, Regards, Youssef -Original Message- From: pacemaker-requ...@oss.clusterlabs.org [mailto:pacemaker-requ...@oss.clusterlabs.org] Sent: Friday, December 14, 2012 5:37 AM To: pacemaker@oss.clusterlabs.org Subject: Pacemaker Digest, Vol 61, Issue 37 Send Pacemaker mailing list submissions to pacemaker@oss.clusterlabs.org To subscribe or unsubscribe via the World Wide Web, visit http://oss.clusterlabs.org/mailman/listinfo/pacemaker or, via email, send a message with subject or body 'help' to pacemaker-requ...@oss.clusterlabs.org You can reach the person managing the list at pacemaker-ow...@oss.clusterlabs.org When replying, please edit your Subject line so it is more specific than Re: Contents of Pacemaker digest... Today's Topics: 1. Re: Action from a different CRMD transition results in restarting services (Andrew Beekhof) 2. Re: problem with float IP with pacemaker (Andrew Beekhof) 3. cman+qdisk+pacemaker - pacemaker qdisk node offline (Rob) 4. Re: booth is the state of started on pacemaker before booth write ticket info in cib. (Jiaju Zhang) 5. Pacemaker stop behaviour when underlying resource is unavailable (pavan tc) -- Message: 1 Date: Fri, 14 Dec 2012 13:32:32 +1100 From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Action from a different CRMD transition results in restarting services Message-ID: CAEDLWG0gzrt0w__tsZKbeELXwdaOHi9KGj_Oxm0877kMxgP=b...@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1 On Fri, Dec 14, 2012 at 1:33 AM, Latrous, Youssef ylatr...@broadviewnet.com wrote: Andrew Beekhof and...@beekhof.net wrote: 18014 is where we're up to now, 16048 is the (old) one that scheduled the recurring monitor operation. I suspect you'll find the action failed earlier in the logs and thats why it needed to be restarted. Not the best log message though :( Thanks Andrew for the quick answer. I still need more info if possible. I searched everywhere for transaction 16048 and I couldn't find a trace of it (looked for up to 5 days of logs prior to transaction 18014). It would have been good if we had timestamps for each transaction involved in this situation :-) Is there a way to find about this old transaction in any other logs (I looked into /var/log/messages on both nodes involved in this cluster)? Its not really relevant. The only important thing is that its not one we're currently executing. What you should care about is any logs that hopefully show you why the resource failed at around Dec 6 22:55:05. To give you an idea of how many transactions happened during this period: TR_ID 18010 @ 21:52:16 ... TR_ID 18018 @ 22:55:25 Over an hour between these two events. Given this, how come such a (very) old transaction (~2000 transactions before current one) only acts now? Could it be stale information in pacemaker? No. It hasn't only just acted now. Its been repeating over and over for the last few weeks or so. The difference is that this time it failed. Thanks in advance. Youssef ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker End of Pacemaker Digest, Vol 61, Issue 37 * ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org
Re: [Pacemaker] wrong device in stonith_admin -l
- Original Message - From: laurent+pacema...@u-picardie.fr To: pacemaker@oss.clusterlabs.org Sent: Tuesday, December 11, 2012 6:51:20 PM Subject: [Pacemaker] wrong device in stonith_admin -l Hi, I've just observed something weird. A node is running a stonith resource for which gethosts gives an empty node list. The result of stonith_admin -l does include it in the device list ! result of stonith_admin -l elasticsearch-05 run from elasticsearch-06 : stonith-xen-peatbull stonith-xen-eddu 2 devices found stonith-xen-peatbull is a correct fencing device stonith-xen-eddu is a fencing device with an empty hostlist running my-xen0 gethosts with the stonith-xen-eddu params by hand doesn't return any host, and it does exit with 0 (is that correct to return 0 with an empty host list ?) logs : Dec 12 01:09:10 elasticsearch-06 stonith-ng[18181]: notice: stonith_device_register: Added 'stonith-cluster-xen' to the device list (6 active devices) Dec 12 01:09:10 elasticsearch-06 attrd[18183]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Dec 12 01:09:10 elasticsearch-06 attrd[18183]: notice: attrd_perform_update: Sent update 5: probe_complete=true Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]: notice: stonith_device_register: Added 'stonith-xen-eddu' to the device list (6 active devices) Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]: notice: stonith_device_register: Added 'stonith-xen-peatbull' to the device list (6 active devices) Dec 12 01:09:12 elasticsearch-06 stonith: [18434]: info: external/my-xen0-ha device OK. Dec 12 01:09:12 elasticsearch-06 crmd[18185]: notice: process_lrm_event: LRM operation stonith-cluster-xen_start_0 (call=61,rc=0, cib-update=27, confirmed=true) ok Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-05 Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-06 Dec 12 01:09:15 elasticsearch-06 stonith: [18465]: info: external/my-xen0 device OK. Dec 12 01:09:15 elasticsearch-06 crmd[18185]: notice: process_lrm_event: LRM operation stonith-xen-peatbull_start_0 (call=68, rc=0, cib-update=28, confirmed=true) ok Dec 12 01:09:15 elasticsearch-06 stonith: [18458]: info: external/my-xen0 device OK. Dec 12 01:09:15 elasticsearch-06 crmd[18185]: notice: process_lrm_event: LRM operation stonith-xen-eddu_start_0 (call=66, rc=0, cib-update=29, confirmed=true) ok Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]: notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-kornog (1): (null) Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]: notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-nikka (1): (null) Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]: notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-yoichi (1): (null) Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: CRIT: external_hostlist: 'my-xen0 gethosts' returned an empty hostlist Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: ERROR: Could not list hosts for external/my-xen0. Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: CRIT: external_hostlist: 'my-xen0 gethosts' returned an empty hostlist Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: ERROR: Could not list hosts for external/my-xen0. Dec 12 01:12:37 elasticsearch-06 stonith-ng[18181]: notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-eddu (1): failed: 255 We discover what hosts a agent can fence by running this command internally in stonith. # agent -o list From there we expect a exit-code of 0 and the list of node to be in the output. https://fedorahosted.org/cluster/wiki/FenceAgentAPI Looking at your logs, stonith-xen-eddu is returning -1 (255) as the return code when we issue the 'list' action. That means we don't try to get the dynamic list again, we assume the 'list' action isn't supported. From there we fall back to using the 'status' action to dynamically determine if agent can fence a particular host. I'm guessing the 'status' action is returning true (return codes 0 or 2) for hosts you wouldn't expect the agent to be able to fence for some reason. -- Vossel David, I mentioned a node being wrongly fenced in the stonith-timeout duration 0 is too low bug, could it be related ? -- ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
[Pacemaker] Patrik Rapposch is out of the office
Ich werde ab 17.12.2012 nicht im Büro sein. Ich kehre zurück am 19.12.2012. Please note, that I am not available. Please always use ksi.netw...@knapp.com, which ensures that one of our network adminsitrators takes care of your interest. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Multi-state slave resource promoted when node was not quorate, expected?
We had a switch failure and all the nodes were partitioned. The slave node promoted its resource while it did not have quorum. We have no-quourm-policy set to freeze. Is it expected for resource promotion to occur when a node does not have quorum? -- Jesse Hathaway, Systems Engineer Braintree http://getbraintree.com 917-418-8423 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Ordered resource is not restarting after migration if it's already started on new host
On Dec 16, 2012, at 7:29 PM, pacemaker-requ...@oss.clusterlabs.org wrote: Message: 5 Date: Mon, 17 Dec 2012 14:23:15 +1100 From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Ordered resource is not restarting after migration if it's already started on new host Message-ID: caedlwg35tfnghmm_fussxedryamss5owfxrdlg5ytcmj7yx...@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1 On Sat, Dec 15, 2012 at 10:58 AM, Neal Peters nealppet...@gmail.com wrote: Hello- I'm running Pacemaker v. 1.1 (pacemaker-1.1.7-6.el6.x86_64) on CentOS 6.3 and am observing behavior on my systems that differs from the behavior described in the manual. Basically, the desired behavior (and the behavior described in Pacemaker Explained Section 6.3.1) is that when a first resource in an ordered set is moved to a host where the then resource is already running, the then resource will be restarted. From Pacemaker Explained 6.3.1 Mandatory Ordering: -If the first resource is (re)started while the then resource is running, the then resource will be stopped and restarted. I am not seeing this behavior however. I am seeing that the then resource is left running. I have 2 servers running a fairly basic setup that is fairly close to the one described in the Clusters from Scratch document. Config follows: node host2 node host1 primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=192.168.0.225 cidr_netmask=32 \ op monitor interval=1s \ meta target-role=Started primitive DNSserver lsb:named \ op monitor interval=1s colocation ip-with-DNSserver inf: DNSserver ClusterIP order DNS-server-after-ip inf: ClusterIP DNSserver property $id=cib-bootstrap-options \ dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1355268791 rsc_defaults $id=rsc-options \ resource-stickiness=102 When the DNSserver resource is migrated from one node to the other and named is already started on the other node (for whatever reason), named is not restarted 1) Ordering constraints are behaving as expected, DNSserver is started after ClusterIP 2) Starting something (DNSserver) that is already started is a no-op 3) Don't start cluster services outside of the cluster 3 is the root problem in your case Thank you for your prompt reply. It sounds as though Pacemaker is operating in the way that you expect in this situation. Your description of Pacemaker behavior 2) Starting something (DNSserver) that is already started is a no-op differs from behavior described in the documentation -If the first resource is (re)started while the then resource is running, the then resource will be stopped and restarted. ( http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-resource-ordering.html#_mandatory_ordering Section 6.3.1) Is there a place that I can/should report this discrepancy between actual behavior and behavior described in the documentation? Thank you. Dec 14 15:32:28 host1 snmpd[5296]: Connection from UDP: [192.168.0.129]:51000-[192.168.0.93] Dec 14 15:32:40 host1 lrmd: [8733]: info: rsc:ClusterIP:5: start Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: ip -f inet addr add 192.168.0.225/32 brd 192.168.0.225 dev eth1 Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: ip link set eth1 up Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: /usr/lib64/heartbeat/send_arp -i 200 -r 5 -p /var/run/heartbeat/rsctmp/se nd_arp-192.168.0.225 eth1 192.168.0.225 auto not_used not_used Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation ClusterIP_start_0 (call=5, rc=0, cib-update=10, co nfirmed=true) ok Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:ClusterIP:6: monitor Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:DNSserver:7: start Dec 14 15:32:41 host1 lrmd: [9601]: WARN: For LSB init script, no additional parameters are needed. Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) Starting named: Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) named: already running Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) [ OK Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) ]#015 Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation DNSserver_start_0 (call=7, rc=0, cib-update=11, co nfirmed=true) ok Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:DNSserver:8: monitor Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation ClusterIP_monitor_1000
[Pacemaker] reloading crm changes
I'm just getting our cluster set up and seem to be missing something about changes made using the crm program. I added some resources and groups using crm = configure = edit. After saving and committing my changes I can see the new resources in resource = show but they are stopped. After running start resource they are still stopped. Also, exiting and running crm_mon does *not* show the new resources. I tried a clean resource just in case, but that did not change anything either. I thought the whole idea of the live resources was they took effect immediately. Am I missing a step? Paul Shannon - Speak the truth, but leave immediately after. - Slovenian proverb** * *Paul Shannon paul.shan...@noaa.gov ITO, WFO Juneau NOAA, National Weather Service ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of started on pacemaker before booth write ticket info in cib.
On Mon, 2012-12-17 at 10:40 +0900, Yuichi SEINO wrote: Hi Jiaju, Perhaps, this problem didn't happen before the following commit. https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f Currently when all of the initialization (including loading the new ticket information) finished, booth should be regarded as ready. So if you encounter some problem here, I guess we should improve the RA to better reflect the booth startup status, but not moving the initialization order, since it may introduce other regression as we have encountered before;) I am not still sure which we should fix RA or booth. I suggest to add a new function to clear the old ticket info in the CIB, and call that function when booth just run but before deamonized. So, before booth_start in the RA returned, the stale data has been cleared. What do you think about this?;) In the case of using cib info, Can you implement it? For example, booth is fail-over on local. Then, booth need to get the ticket in cib. If there is no this problem, I can agree to it. OK, I'll implement it;) Thanks, Jiaju ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Mysql configuration
I'm having a bit of trouble with setting up the master/slave mysql configuration with pacemaker. Using ubuntu 10.04LTS with the most recent resource agent package from: https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa. When I check the status in pacemaker my first node is successfully showing as a master and the second as a slave, and upon checking mysql this is true, but the slave is not correctly set up with the master as the log file and position are incorrect so it is not picking up any changes from the master. I noticed in the config two lines are being added automatically to the node attributes for the slave specifying the file and position but they are incorrect. Where are these generated from or how can I configure things to properly detect the master log file and position? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] reloading crm changes
On 12/17/2012 11:29 PM, Paul Shannon - NOAA Federal wrote: I'm just getting our cluster set up and seem to be missing something about changes made using the crm program. I added some resources and groups using crm = configure = edit. After saving and committing my changes I can see the new resources in resource = show but they are stopped. After running start resource they are still stopped. Also, exiting and running crm_mon does *not* show the new resources. I tried a clean resource just in case, but that did not change anything either. By default stonith is enabled you have configured a stonith-resource? If not, resource management is disabled until you do ... or disable stonith ... and you need quorum if you don't ignore it Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now I thought the whole idea of the live resources was they took effect immediately. Am I missing a step? Paul Shannon - Speak the truth, but leave immediately after. - Slovenian proverb// / /Paul Shannon paul.shan...@noaa.gov mailto:paul.shan...@noaa.gov ITO, WFO Juneau NOAA, National Weather Service ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] reloading crm changes
Andreas, I do have no-quorum-policy=ignore set and stonith-enabled=false. Also, I do have some resources running. Its just when I tried to add another one that I cannot get it to take. Paul - Speak the truth, but leave immediately after. - Slovenian proverb** * *Paul Shannon paul.shan...@noaa.gov ITO, WFO Juneau NOAA, National Weather Service On Mon, Dec 17, 2012 at 11:22 PM, Andreas Kurz andr...@hastexo.com wrote: On 12/17/2012 11:29 PM, Paul Shannon - NOAA Federal wrote: I'm just getting our cluster set up and seem to be missing something about changes made using the crm program. I added some resources and groups using crm = configure = edit. After saving and committing my changes I can see the new resources in resource = show but they are stopped. After running start resource they are still stopped. Also, exiting and running crm_mon does *not* show the new resources. I tried a clean resource just in case, but that did not change anything either. By default stonith is enabled you have configured a stonith-resource? If not, resource management is disabled until you do ... or disable stonith ... and you need quorum if you don't ignore it Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now I thought the whole idea of the live resources was they took effect immediately. Am I missing a step? Paul Shannon - Speak the truth, but leave immediately after. - Slovenian proverb// / /Paul Shannon paul.shan...@noaa.gov mailto:paul.shan...@noaa.gov ITO, WFO Juneau NOAA, National Weather Service ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] reloading crm changes
On 12/18/2012 12:58 AM, Paul Shannon - NOAA Federal wrote: Andreas, I do have no-quorum-policy=ignore set and stonith-enabled=false. Also, I do have some resources running. Its just when I tried to add another one that I cannot get it to take. what does crm_mon -1frA show? and of course logs should give all information needed ... Regards, Andreas Paul - Speak the truth, but leave immediately after. - Slovenian proverb// / /Paul Shannon paul.shan...@noaa.gov mailto:paul.shan...@noaa.gov ITO, WFO Juneau NOAA, National Weather Service On Mon, Dec 17, 2012 at 11:22 PM, Andreas Kurz andr...@hastexo.com mailto:andr...@hastexo.com wrote: On 12/17/2012 11:29 PM, Paul Shannon - NOAA Federal wrote: I'm just getting our cluster set up and seem to be missing something about changes made using the crm program. I added some resources and groups using crm = configure = edit. After saving and committing my changes I can see the new resources in resource = show but they are stopped. After running start resource they are still stopped. Also, exiting and running crm_mon does *not* show the new resources. I tried a clean resource just in case, but that did not change anything either. By default stonith is enabled you have configured a stonith-resource? If not, resource management is disabled until you do ... or disable stonith ... and you need quorum if you don't ignore it Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now I thought the whole idea of the live resources was they took effect immediately. Am I missing a step? Paul Shannon - Speak the truth, but leave immediately after. - Slovenian proverb// / /Paul Shannon paul.shan...@noaa.gov mailto:paul.shan...@noaa.gov mailto:paul.shan...@noaa.gov mailto:paul.shan...@noaa.gov ITO, WFO Juneau NOAA, National Weather Service ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org mailto:Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org mailto:Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Ordered resource is not restarting after migration if it's already started on new host
On Tue, Dec 18, 2012 at 6:28 AM, Neal Peters nealppet...@gmail.com wrote: On Dec 16, 2012, at 7:29 PM, pacemaker-requ...@oss.clusterlabs.org wrote: Message: 5 Date: Mon, 17 Dec 2012 14:23:15 +1100 From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Ordered resource is not restarting after migration if it's already started on new host Message-ID: caedlwg35tfnghmm_fussxedryamss5owfxrdlg5ytcmj7yx...@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1 On Sat, Dec 15, 2012 at 10:58 AM, Neal Peters nealppet...@gmail.com wrote: Hello- I'm running Pacemaker v. 1.1 (pacemaker-1.1.7-6.el6.x86_64) on CentOS 6.3 and am observing behavior on my systems that differs from the behavior described in the manual. Basically, the desired behavior (and the behavior described in Pacemaker Explained Section 6.3.1) is that when a first resource in an ordered set is moved to a host where the then resource is already running, the then resource will be restarted. From Pacemaker Explained 6.3.1 Mandatory Ordering: -If the first resource is (re)started while the then resource is running, the then resource will be stopped and restarted. I am not seeing this behavior however. I am seeing that the then resource is left running. I have 2 servers running a fairly basic setup that is fairly close to the one described in the Clusters from Scratch document. Config follows: node host2 node host1 primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=192.168.0.225 cidr_netmask=32 \ op monitor interval=1s \ meta target-role=Started primitive DNSserver lsb:named \ op monitor interval=1s colocation ip-with-DNSserver inf: DNSserver ClusterIP order DNS-server-after-ip inf: ClusterIP DNSserver property $id=cib-bootstrap-options \ dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1355268791 rsc_defaults $id=rsc-options \ resource-stickiness=102 When the DNSserver resource is migrated from one node to the other and named is already started on the other node (for whatever reason), named is not restarted 1) Ordering constraints are behaving as expected, DNSserver is started after ClusterIP 2) Starting something (DNSserver) that is already started is a no-op 3) Don't start cluster services outside of the cluster 3 is the root problem in your case Thank you for your prompt reply. It sounds as though Pacemaker is operating in the way that you expect in this situation. Your description of Pacemaker behavior 2) Starting something (DNSserver) that is already started is a no-op differs from behavior described in the documentation No, it doesn't. The cluster _is_ trying to start the resource (we stopped it on the old host and are trying to start it on the new one), however the named init script is simply ignoring the request because named is already running. Also this behaviour by the named script is mandated by the LSB standard. Which is why I said #3 was the problem you need to fix -If the first resource is (re)started while the then resource is running, the then resource will be stopped and restarted. ( http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-resource-ordering.html#_mandatory_ordering Section 6.3.1) Is there a place that I can/should report this discrepancy between actual behavior and behavior described in the documentation? Thank you. Dec 14 15:32:28 host1 snmpd[5296]: Connection from UDP: [192.168.0.129]:51000-[192.168.0.93] Dec 14 15:32:40 host1 lrmd: [8733]: info: rsc:ClusterIP:5: start Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: ip -f inet addr add 192.168.0.225/32 brd 192.168.0.225 dev eth1 Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: ip link set eth1 up Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: /usr/lib64/heartbeat/send_arp -i 200 -r 5 -p /var/run/heartbeat/rsctmp/se nd_arp-192.168.0.225 eth1 192.168.0.225 auto not_used not_used Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation ClusterIP_start_0 (call=5, rc=0, cib-update=10, co nfirmed=true) ok Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:ClusterIP:6: monitor Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:DNSserver:7: start Dec 14 15:32:41 host1 lrmd: [9601]: WARN: For LSB init script, no additional parameters are needed. Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) Starting named: Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) named: already running Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) [ OK Dec 14 15:32:41 host1 lrmd: [8733]: info:
Re: [Pacemaker] Multi-state slave resource promoted when node was not quorate, expected?
On Tue, Dec 18, 2012 at 5:55 AM, Jesse Hathaway jesse.hatha...@getbraintree.com wrote: We had a switch failure and all the nodes were partitioned. The slave node promoted its resource while it did not have quorum. We have no-quourm-policy set to freeze. Is it expected for resource promotion to occur when a node does not have quorum? No. That sounds like a bug. Can you attach a crm_report tarball to a bugzilla entry please? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Action from a different CRMD transition results in
On Tue, Dec 18, 2012 at 1:39 AM, Latrous, Youssef ylatr...@broadviewnet.com wrote: Hi Andrew, Thank you for following up. I still don't see what went wrong. From the logs, RabbitMQ was working just fine around that time until it was ordered to shut down by CRM (for the failed monitor?). Apparently not, otherwise the monitor would not have reported a failure. Something went wrong, either in the resource script or the RabbitMQ itself. Moreover, I assume that transitions are ordered monotonically, which means that Transition ID 16048 happened before Transition ID 18014: 16048 18014 According to the logs, Transition ID 16048 wasn't present in the logs dating several days before transition ID 18014 was generated. I'll then assume that it was generated several days ago (if not true, please give me a way of finding out when did this transition happen - I still believe that time is of essence in this case). Our monitor command timers are expressed in seconds. In that case, how can we say: It hasn't only just acted now. Its been repeating over and over for the last few weeks or so. Because thats how its designed, thats what recurring monitors do, the lrmd schedules them to run over and over every N seconds and the lrmd lets us know when something changes. My understanding is that a transition happens once and only once: it succeeds, fails or is aborted altogether. No. Corresponding events can repeat over and over, but each time can only be part a new transition. Am I missing something fundamental here? Yes. See above. Sorry to insist, but I have to answer this very simple question: What did happen here? Your resource or resource agent had a problem. More than that I can't say because I don't have access to your logs. I'm sure you can understand my situation here. Thank you in advance for your help, Regards, Youssef -Original Message- From: pacemaker-requ...@oss.clusterlabs.org [mailto:pacemaker-requ...@oss.clusterlabs.org] Sent: Friday, December 14, 2012 5:37 AM To: pacemaker@oss.clusterlabs.org Subject: Pacemaker Digest, Vol 61, Issue 37 Send Pacemaker mailing list submissions to pacemaker@oss.clusterlabs.org To subscribe or unsubscribe via the World Wide Web, visit http://oss.clusterlabs.org/mailman/listinfo/pacemaker or, via email, send a message with subject or body 'help' to pacemaker-requ...@oss.clusterlabs.org You can reach the person managing the list at pacemaker-ow...@oss.clusterlabs.org When replying, please edit your Subject line so it is more specific than Re: Contents of Pacemaker digest... Today's Topics: 1. Re: Action from a different CRMD transition results in restarting services (Andrew Beekhof) 2. Re: problem with float IP with pacemaker (Andrew Beekhof) 3. cman+qdisk+pacemaker - pacemaker qdisk node offline (Rob) 4. Re: booth is the state of started on pacemaker before booth write ticket info in cib. (Jiaju Zhang) 5. Pacemaker stop behaviour when underlying resource is unavailable (pavan tc) -- Message: 1 Date: Fri, 14 Dec 2012 13:32:32 +1100 From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Action from a different CRMD transition results in restarting services Message-ID: CAEDLWG0gzrt0w__tsZKbeELXwdaOHi9KGj_Oxm0877kMxgP=b...@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1 On Fri, Dec 14, 2012 at 1:33 AM, Latrous, Youssef ylatr...@broadviewnet.com wrote: Andrew Beekhof and...@beekhof.net wrote: 18014 is where we're up to now, 16048 is the (old) one that scheduled the recurring monitor operation. I suspect you'll find the action failed earlier in the logs and thats why it needed to be restarted. Not the best log message though :( Thanks Andrew for the quick answer. I still need more info if possible. I searched everywhere for transaction 16048 and I couldn't find a trace of it (looked for up to 5 days of logs prior to transaction 18014). It would have been good if we had timestamps for each transaction involved in this situation :-) Is there a way to find about this old transaction in any other logs (I looked into /var/log/messages on both nodes involved in this cluster)? Its not really relevant. The only important thing is that its not one we're currently executing. What you should care about is any logs that hopefully show you why the resource failed at around Dec 6 22:55:05. To give you an idea of how many transactions happened during this period: TR_ID 18010 @ 21:52:16 ... TR_ID 18018 @ 22:55:25 Over an hour between these two events. Given this, how come such a (very) old transaction (~2000 transactions before current one) only acts now? Could it be stale information in
Re: [Pacemaker] Pacemaker stop behaviour when underlying resource is unavailable
On Fri, Dec 14, 2012 at 9:32 PM, pavan tc pavan...@gmail.com wrote: Hi, I have structured my multi-state resource agent as below when the underlying resource becomes unavailable for some reason: monitor() { state=get_primitive_resource_state() ... ... if ($state == unavailable) return $OCF_NOT_RUNNING ... ... } stop() { monitor() ret=$? if (ret == $OCF_NOT_RUNNING) return $OCF_SUCCESS } start() { start_primitive() if (start_primitive_failure) return OCF_ERR_GENERIC } The idea is to make sure that stop does not fail when the underlying resource goes away. (Otherwise I see that the resource gets to an unmanaged state) Also, the expectation is that when the resource comes back, it joins the cluster without much fuss. What I see is that pacemaker calls stop twice That would not be expected. Bug? and if it finds that stop returns success, it does not continue with monitor any more. I also do not see an attempt to start. Anywhere? Or just on the same node? Is there a way to keep the monitor going in such circumstances? Not really. You can define a recurring monitor for the Stopped role though. But why would it come back? You _really_ should not be starting services outside of the cluster - not least of all because we've probably started it somewhere else in the meantime. Am I using incorrect resource agent return codes? Thanks, Pavan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker stop behaviour when underlying resource is unavailable
[..] The idea is to make sure that stop does not fail when the underlying resource goes away. (Otherwise I see that the resource gets to an unmanaged state) Also, the expectation is that when the resource comes back, it joins the cluster without much fuss. What I see is that pacemaker calls stop twice That would not be expected. Bug? Are you pointing at stop getting called 'twice'? If yes, I will confirm once more about the behaviour and will raise a bug. and if it finds that stop returns success, it does not continue with monitor any more. I also do not see an attempt to start. Anywhere? Or just on the same node? On the same node. The resource does get promoted on the other node. My expectation was that if I kept returning OCF_NOT_RUNNING in monitor, then it should attempt a start-stop-monitor cycle till the resource came back. It seems this is not what the cluster manager does? Is there a way to keep the monitor going in such circumstances? Not really. You can define a recurring monitor for the Stopped role though. I did not want to go there if I could achieve it via the usual mechanisms. If that is not, possible, I will explore this option in more detail. But why would it come back? You _really_ should not be starting services outside of the cluster - not least of all because we've probably started it somewhere else in the meantime. Even if we started the resource elsewhere, we are running in degraded mode. (My bad, I did not mention this is a _two-node_ multi-state resource). We would like to come back to the available mode as early as possible and with the least amount of manual intervention with the cluster. Pavan Am I using incorrect resource agent return codes? Thanks, Pavan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker stop behaviour when underlying resource is unavailable
On Tue, Dec 18, 2012 at 4:24 PM, pavan tc pavan...@gmail.com wrote: [..] The idea is to make sure that stop does not fail when the underlying resource goes away. (Otherwise I see that the resource gets to an unmanaged state) Also, the expectation is that when the resource comes back, it joins the cluster without much fuss. What I see is that pacemaker calls stop twice That would not be expected. Bug? Are you pointing at stop getting called 'twice'? Correct If yes, I will confirm once more about the behaviour and will raise a bug. and if it finds that stop returns success, it does not continue with monitor any more. I also do not see an attempt to start. Anywhere? Or just on the same node? On the same node. The resource does get promoted on the other node. My expectation was that if I kept returning OCF_NOT_RUNNING in monitor, then it should attempt a start-stop-monitor cycle till the resource came back. It seems this is not what the cluster manager does? Not always, it very much depends on the constraints you've defined and things like migration-threshold. Is there a way to keep the monitor going in such circumstances? Not really. You can define a recurring monitor for the Stopped role though. I did not want to go there if I could achieve it via the usual mechanisms. If you want to monitor a resource on a node that its not running on, that _is_ the usual mechanism. The thing is that it's an unusual thing to want to do. If that is not, possible, I will explore this option in more detail. But why would it come back? You _really_ should not be starting services outside of the cluster - not least of all because we've probably started it somewhere else in the meantime. Even if we started the resource elsewhere, we are running in degraded mode. Not on the node for which you returned stopped. There you are just flat-out not running at all. (My bad, I did not mention this is a _two-node_ multi-state resource). We would like to come back to the available mode as early as possible and with the least amount of manual intervention with the cluster. Normally I wouldn't expect any manual intervention either, but I really can't comment further without seeing logs and configs. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org