[Pacemaker] rejoin failure
Hi, we have a 2 node corosync 1.4.2 and pacemaker 1.1.7 cluster running drdb, nfs, solr and redis in master slave configurations. Currently node 2 is unable to rejoin cluster after being fenced by stonith. the logs on node 2 Dec 15 01:52:38 www2 cib: [6705]: info: ais_dispatch_message: Membership 0: quorum still lost Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node www2: id=33610762 state=member (new) addr=(null) votes=1 (new) born=0 seen=0 proc=00111312 (new) Dec 15 01:52:38 www2 stonith-ng: [6706]: info: get_ais_nodeid: Server details: id=33610762 uname=www2 cname=pcmk Dec 15 01:52:38 www2 stonith-ng: [6706]: info: init_ais_connection_once: Connection to 'classic openais (with plugin)': established Dec 15 01:52:38 www2 stonith-ng: [6706]: info: crm_new_peer: Node www2 now has id: 33610762 Dec 15 01:52:38 www2 stonith-ng: [6706]: info: crm_new_peer: Node 33610762 is now known as www2 Dec 15 01:52:38 www2 attrd: [6708]: notice: main: Starting mainloop... Dec 15 01:52:38 www2 stonith-ng: [6706]: notice: setup_cib: Watching for stonith topology changes Dec 15 01:52:38 www2 stonith-ng: [6706]: info: main: Starting stonith-ng mainloop Dec 15 01:52:38 www2 corosync[6682]: [TOTEM ] Incrementing problem counter for seqid 1 iface 46.248.167.141 to [1 of 10] Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 11800: memb=0, new=0, lost=0 Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 11800: memb=1, new=1, lost=0 Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: pcmk_peer_update: NEW: www2 33610762 Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: pcmk_peer_update: MEMB: www2 33610762 Dec 15 01:52:38 www2 corosync[6682]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Dec 15 01:52:38 www2 corosync[6682]: [CPG ] chosen downlist: sender r(0) ip(10.220.0.2) r(1) ip(46.248.167.141) ; members(old:0 left:0) Dec 15 01:52:38 www2 corosync[6682]: [MAIN ] Completed service synchronization, ready to provide service. Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 11804: memb=1, new=0, lost=0 Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: pcmk_peer_update: memb: www2 33610762 Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 11804: memb=2, new=1, lost=0 Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: update_member: Creating entry for node 16833546 born on 11804 Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: update_member: Node 16833546/unknown is now: member Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: pcmk_peer_update: NEW: .pending. 16833546 Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: pcmk_peer_update: MEMB: .pending. 16833546 Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: pcmk_peer_update: MEMB: www2 33610762 Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: send_member_notification: Sending membership update 11804 to 1 children Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: update_member: 0x200cdc0 Node 33610762 ((null)) born on: 11804 Dec 15 01:52:38 www2 corosync[6682]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Dec 15 01:52:38 www2 cib: [6705]: info: ais_dispatch_message: Membership 11804: quorum still lost Dec 15 01:52:38 www2 cib: [6705]: info: crm_new_peer: Node now has id: 16833546 Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node (null): id=16833546 state=member (new) addr=r(0) ip(10.220.0.1) r(1) ip(46.248.167.140) votes=0 born=0 seen=11804 proc= Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node www2: id=33610762 state=member addr=r(0) ip(10.220.0.2) r(1) ip(46.248.167.141) (new) votes=1 born=0 seen=11804 proc=00111312 Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: update_member: 0x20157a0 Node 16833546 (www1) born on: 11708 Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: update_member: 0x20157a0 Node 16833546 now known as www1 (was: (null)) Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: update_member: Node www1 now has process list: 00111312 (1118994) Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: update_member: Node www1 now has 1 quorum votes (was 0) Dec 15 01:52:38 www2 corosync[6682]: [pcmk ] info: send_member_notification: Sending membership update 11804 to 1 children Dec 15 01:52:38 www2 cib: [6705]: notice: ais_dispatch_message: Membership 11804: quorum acquired Dec 15 01:52:38 www2 cib: [6705]: info: crm_get_peer: Node 16833546 is now known as www1 Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node www1: id=16833546 state=member addr=r(0) ip(10.220.0.1) r(1) ip(46.248.167.140) votes=1 (new) born=11708 seen=11804 proc=000
[Pacemaker] Ordered resource is not restarting after migration if it's already started on new host
Hello- I'm running Pacemaker v. 1.1 (pacemaker-1.1.7-6.el6.x86_64) on CentOS 6.3 and am observing behavior on my systems that differs from the behavior described in the manual. Basically, the desired behavior (and the behavior described in Pacemaker Explained Section 6.3.1) is that when a "first" resource in an ordered set is moved to a host where the "then" resource is already running, the "then" resource will be restarted. >From Pacemaker Explained 6.3.1 Mandatory Ordering: -If the first resource is (re)started while the then resource is running, the then resource will be stopped and restarted. I am not seeing this behavior however. I am seeing that the "then" resource is left running. I have 2 servers running a fairly basic setup that is fairly close to the one described in the Clusters from Scratch document. Config follows: node host2 node host1 primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip="192.168.0.225" cidr_netmask="32" \ op monitor interval="1s" \ meta target-role="Started" primitive DNSserver lsb:named \ op monitor interval="1s" colocation ip-with-DNSserver inf: DNSserver ClusterIP order DNS-server-after-ip inf: ClusterIP DNSserver property $id="cib-bootstrap-options" \ dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ last-lrm-refresh="1355268791" rsc_defaults $id="rsc-options" \ resource-stickiness="102" When the DNSserver resource is migrated from one node to the other and named is already started on the other node (for whatever reason), named is not restarted Dec 14 15:32:28 host1 snmpd[5296]: Connection from UDP: [192.168.0.129]:51000->[192.168.0.93] Dec 14 15:32:40 host1 lrmd: [8733]: info: rsc:ClusterIP:5: start Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: ip -f inet addr add 192.168.0.225/32 brd 192.168.0.225 dev eth1 Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: ip link set eth1 up Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: /usr/lib64/heartbeat/send_arp -i 200 -r 5 -p /var/run/heartbeat/rsctmp/se nd_arp-192.168.0.225 eth1 192.168.0.225 auto not_used not_used Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation ClusterIP_start_0 (call=5, rc=0, cib-update=10, co nfirmed=true) ok Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:ClusterIP:6: monitor Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:DNSserver:7: start Dec 14 15:32:41 host1 lrmd: [9601]: WARN: For LSB init script, no additional parameters are needed. Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) Starting named: Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) named: already running Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) [ OK Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) ]#015 Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation DNSserver_start_0 (call=7, rc=0, cib-update=11, co nfirmed=true) ok Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:DNSserver:8: monitor Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation ClusterIP_monitor_1000 (call=6, rc=0, cib-update=1 2, confirmed=false) ok Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation DNSserver_monitor_1000 (call=8, rc=0, cib-update=1 3, confirmed=false) ok Are there errors in my config that are keeping the restart from happening? Thanks in advance. -Neal ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] wrong device in stonith_admin -l
Andrew Beekhof writes: > On Wed, Dec 12, 2012 at 11:51 AM, wrote: >> >> Hi, >> >> I've just observed something weird. >> A node is running a stonith resource for which gethosts gives an empty >> node list. The result of stonith_admin -l does include it in the >> device list ! >> >> result of "stonith_admin -l elasticsearch-05" run from >> elasticsearch-06 : >> stonith-xen-peatbull >> stonith-xen-eddu >> 2 devices found >> >> stonith-xen-peatbull is a correct fencing device >> stonith-xen-eddu is a fencing device with an empty hostlist >> >> running "my-xen0 gethosts" with the stonith-xen-eddu params by hand >> doesn't return any host, and it does exit with 0 (is that correct to >> return 0 with an empty host list ?) >> >> logs : >> Dec 12 01:09:10 elasticsearch-06 stonith-ng[18181]: notice: >> stonith_device_register: Added 'stonith-cluster-xen' to the device list (6 >> active devices) >> Dec 12 01:09:10 elasticsearch-06 attrd[18183]: notice: >> attrd_trigger_update: Sending flush op to all hosts for: probe_complete >> (true) >> Dec 12 01:09:10 elasticsearch-06 attrd[18183]: notice: >> attrd_perform_update: Sent update 5: probe_complete=true >> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]: notice: >> stonith_device_register: Added 'stonith-xen-eddu' to the device list (6 >> active devices) >> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]: notice: >> stonith_device_register: Added 'stonith-xen-peatbull' to the device list (6 >> active devices) >> Dec 12 01:09:12 elasticsearch-06 stonith: [18434]: info: external/my-xen0-ha >> device OK. >> Dec 12 01:09:12 elasticsearch-06 crmd[18185]: notice: process_lrm_event: >> LRM operation stonith-cluster-xen_start_0 (call=61,rc=0, cib-update=27, >> confirmed=true) ok >> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: >> '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-05 >> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: >> '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-06 >> Dec 12 01:09:15 elasticsearch-06 stonith: [18465]: info: external/my-xen0 >> device OK. >> Dec 12 01:09:15 elasticsearch-06 crmd[18185]: notice: process_lrm_event: >> LRM operation stonith-xen-peatbull_start_0 (call=68, rc=0, cib-update=28, >> confirmed=true) ok >> Dec 12 01:09:15 elasticsearch-06 stonith: [18458]: info: external/my-xen0 >> device OK. >> Dec 12 01:09:15 elasticsearch-06 crmd[18185]: notice: process_lrm_event: >> LRM operation stonith-xen-eddu_start_0 (call=66, rc=0, cib-update=29, >> confirmed=true) ok >> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]: notice: >> dynamic_list_search_cb: Disabling port list queries for stonith-xen-kornog >> (1): (null) >> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]: notice: >> dynamic_list_search_cb: Disabling port list queries for stonith-xen-nikka >> (1): (null) >> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]: notice: >> dynamic_list_search_cb: Disabling port list queries for stonith-xen-yoichi >> (1): (null) >> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: CRIT: external_hostlist: >> 'my-xen0 gethosts' returned an empty hostlist >> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: ERROR: Could not list >> hosts for external/my-xen0. >> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: CRIT: external_hostlist: >> 'my-xen0 gethosts' returned an empty hostlist >> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: ERROR: Could not list >> hosts for external/my-xen0. >> Dec 12 01:12:37 elasticsearch-06 stonith-ng[18181]: notice: >> dynamic_list_search_cb: Disabling port list queries for stonith-xen-eddu >> (1): failed: 255 >> >> David, I mentioned a node being wrongly fenced in the "stonith-timeout >> duration 0 is too low" bug, could it be related ? Hi, > Doubtful, what does your config look like? i've restarted from scratch with a simpler setup: primitive dummy_01 ocf:heartbeat:Dummy \ meta allow-migrate="true" \ op monitor interval="180" timeout="20" primitive stonith-xen-eddu stonith:external/my-xen0 \ params hostlist="elasticsearch-01 elasticsearch-02 elasticsearch-03 elasticsearch-04" dom0="eddu" clone clone-stonith-xen-eddu stonith-xen-eddu \ meta clone-max="3" clone-node-max="1" location clone-stonith-xen-eddu-location-01 clone-stonith-xen-eddu \ rule $id="clone-stonith-xen-eddu-location-01-rule" inf: defined #uname location dummy_01-location-01 dummy_01 \ rule $id="dummy_01-location-01-rule" inf: defined #uname property $id="cib-bootstrap-options" \ dc-version="1.1.8-56429db" \ cluster-infrastructure="corosync" \ stonith-timeout="120" \ symmetric-cluster="false" \ no-quorum-policy="stop" \ stonith-enabled="true" there're 6 nodes: elasticsearch-01 ... 06 afaik pcmk_host_check defaults to "dynamic-list". when the external stonith
Re: [Pacemaker] Suggestion to improve movement of booth
Hi, Jiaju Thanks for the reply. 2012/12/12 Jiaju Zhang : > This is what I wanted to do as well;) That is to say, the lease should > keep renewing on the original site successfully unless it was down. > Current implementation is to let the original site renew the ticket > before ticket lease expires (only when lease expires the ticket will be > revoked), hence, before other sites tries to acquire the ticket, the > original site has renewed the ticket already, so the result is the > ticket is still on that site. > > I'm not quite understand your problem here. Is that the lease not > keeping in the original site? When reaccession of lease is carried out, loss-policy acts because revoke of the ticket is performed temporarily. For example, as for the node which the resource has started, STONITH is performed by this temporary revoke when loss-policy is fence. At this time, service will stop until it is rebooted at other nodes or a site. I would like to prevent this behavior. When an original site repeats "renew", I make modifications not to let you promise the preparations from other sites. I think that unnecessary revoke is not performed by this correction. Regards, Yusuke -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Pacemaker stop behaviour when underlying resource is unavailable
Hi, I have structured my multi-state resource agent as below when the underlying resource becomes unavailable for some reason: monitor() { state=get_primitive_resource_state() ... ... if ($state == unavailable) return $OCF_NOT_RUNNING ... ... } stop() { monitor() ret=$? if (ret == $OCF_NOT_RUNNING) return $OCF_SUCCESS } start() { start_primitive() if (start_primitive_failure) return OCF_ERR_GENERIC } The idea is to make sure that stop does not fail when the underlying resource goes away. (Otherwise I see that the resource gets to an unmanaged state) Also, the expectation is that when the resource comes back, it joins the cluster without much fuss. What I see is that pacemaker calls stop twice and if it finds that stop returns success, it does not continue with monitor any more. I also do not see an attempt to start. Is there a way to keep the monitor going in such circumstances? Am I using incorrect resource agent return codes? Thanks, Pavan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
On Thu, 2012-12-13 at 12:01 +0900, Yuichi SEINO wrote: > Hi Jiaju, > > 2012/12/12 Jiaju Zhang : > > On Tue, 2012-12-11 at 20:15 +0900, Yuichi SEINO wrote: > >> Hi Jiaju, > >> > >> Currently, booth is the state of "started" on pacemaker before booth > >> writes ticket information in cib. So, If the old ticket information is > >> included in cib, a resource relating to the ticket may start before > >> booth resets the ticket. I think that this problem is when to be > >> daemon in booth. > > > > The resouce should not be started before the booth daemon is ready. We > > suggest to configure an ordering constraint for the booth daemon and the > > managed resources by that ticket. That being said, if the ticket is in > > the CIB but booth daemon has not been started, the resources would not > > be started. > > > > booth RA finishes booth_start when booth changed the daemon from the > foreground process.(To be exact, "sleep 1" is included). The current > booth change daemon before catchup. On the other hand, the previous > booth change daemon after catchup. catchup write a ticket in cib. > Even if an ordering constraint is set, as shown below, the related > resource can start when booth changes the state of "started" on > pacemaker. At this point, the current booth still may not finish > catchup. Oh, I think I have known your problem, thanks! > > crm_mon paste. > ... > booth(ocf::pacemaker:booth-site):Started multi-site-a-1 > ... > > >> > >> Perhaps, this problem didn't happen before the following commit. > >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f > > > > Currently when all of the initialization (including loading the new > > ticket information) finished, booth should be regarded as ready. So if > > you encounter some problem here, I guess we should improve the RA to > > better reflect the booth startup status, but not moving the > > initialization order, since it may introduce other regression as we have > > encountered before;) > > > > I am not still sure which we should fix RA or booth. I suggest to add a new function to clear the old ticket info in the CIB, and call that function when booth just run but before deamonized. So, before booth_start in the RA returned, the stale data has been cleared. What do you think about this?;) Thanks, Jiaju > > > Thanks, > > Jiaju > > > >> > >> Sincerely, > >> Yuichi > >> > > > > > -- > Yuichi SEINO > METROSYSTEMS CORPORATION > E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org