Re: [Pacemaker] Pacemaker/corosync freeze
Hello David, -Original Message- From: David Vossel [mailto:dvos...@redhat.com] Sent: Thursday, March 13, 2014 9:22 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze - Original Message - From: Jan Friesse jfrie...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, March 13, 2014 4:03:28 AM Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify -L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) yes, there was a libqb/corosync interoperation problem that showed these same symptoms last year. Updating to the latest corosync and libqb will likely resolve this. I have upgraded all nodes to these version and we are testing. So far no issues. Thank you very much for your help. Regards, Attila - And maybe also newer pacemaker I know you were not very happy using hand-compiled sources, but please give them at least a try. Thanks, Honza Thanks, Attila Regards, Honza There are also a few things that might or might not be related: 1) Whenever I want to edit the configuration with crm configure edit, ... ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): Hi Honza, What I also found in the log related to the freeze at 12:22:26: Corosync main process was not scheduled for ... Can It be the general cause of the issue? Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:58597-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:59647-[10.9.1.3]:161 Mar 13 12:22:26 ctmgr corosync[3024]: [MAIN ] Corosync main process was not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token timeout increase. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] The token was lost in the OPERATIONAL state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] A processor failed, forming new configuration. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering GATHER state from 2(The token was lost in the OPERATIONAL state.). Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Creating commit token because I am the rep. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Saving state aru 6a8c high seq received 6a8c Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Storing new sequence id for ring 7dc Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering COMMIT state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] got commit token Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering RECOVERY state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [0] member 10.9.1.3: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [1] member 10.9.1.41: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [2] member 10.9.1.42: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [3] member 10.9.1.71: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [4] member 10.9.1.72: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [5] member 10.9.2.11: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [6] member 10.9.2.12: Regards, Attila -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Thursday, March 13, 2014 2:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Thursday, March 13, 2014 1:45 PM To: The Pacemaker cluster resource manager; Andrew Beekhof Subject: Re: [Pacemaker] Pacemaker/corosync freeze Hello, -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Thursday, March 13, 2014 10:03 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify - L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Perhaps Andrew could comment on that. Any idea? Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) - And maybe also newer pacemaker I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from Ubuntu package. I am currently building libqb 0.17.0, will update you on the results. In the meantime we had another freeze, which did not seem to be related to any restarts, but brought all coroync processes to 100%. Please check out the corosync.log, perhaps it is a different cause: http://pastebin.com/WMwzv0Rr In the meantime I will install the new libqb and send logs if we have further issues. Thank you very much for your help! Regards, Attila One more question: If I install libqb 0.17.0 from
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): Hi Honza, What I also found in the log related to the freeze at 12:22:26: Corosync main process was not scheduled for ... Can It be the general cause of the issue? I don't think it will cause issue you are hitting BUT keep in mind that if corosync is not scheduled for long time, it's probably fenced by other node. So increase timeout is vital. Honza Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:58597-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:59647-[10.9.1.3]:161 Mar 13 12:22:26 ctmgr corosync[3024]: [MAIN ] Corosync main process was not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token timeout increase. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] The token was lost in the OPERATIONAL state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] A processor failed, forming new configuration. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering GATHER state from 2(The token was lost in the OPERATIONAL state.). Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Creating commit token because I am the rep. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Saving state aru 6a8c high seq received 6a8c Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Storing new sequence id for ring 7dc Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering COMMIT state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] got commit token Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering RECOVERY state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [0] member 10.9.1.3: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [1] member 10.9.1.41: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [2] member 10.9.1.42: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [3] member 10.9.1.71: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [4] member 10.9.1.72: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [5] member 10.9.2.11: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [6] member 10.9.2.12: Regards, Attila -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Thursday, March 13, 2014 2:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Thursday, March 13, 2014 1:45 PM To: The Pacemaker cluster resource manager; Andrew Beekhof Subject: Re: [Pacemaker] Pacemaker/corosync freeze Hello, -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Thursday, March 13, 2014 10:03 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify - L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Perhaps Andrew could comment on that. Any idea? Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) - And maybe also newer pacemaker I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from Ubuntu package. I am currently building libqb 0.17.0, will update you on the results. In the meantime we had another freeze, which did not seem to be related to any restarts, but brought all coroync processes to 100%. Please check out the corosync.log, perhaps it is a different cause: http://pastebin.com/WMwzv0Rr In the
[Pacemaker] crmd was aborted at pacemaker 1.1.11
Hi, When specifying the node name in UPPER case and performing crm_resource, crmd was aborted. (The real node name is a LOWER case.) # crm_resource -C -r p1 -N X3650H Cleaning up p1 on X3650H Waiting for 1 replies from the CRMdNo messages received in 60 seconds.. aborting Mar 14 18:33:10 x3650h crmd[10718]:error: crm_abort: do_lrm_invoke: Triggered fatal assert at lrm.c:1240 : lrm_state != NULL ...snip... Mar 14 18:33:10 x3650h pacemakerd[10708]:error: child_waitpid: Managed process 10718 (crmd) dumped core * The state before performing crm_resource. Stack: corosync Current DC: x3650g (3232261383) - partition with quorum Version: 1.1.10-38c5972 2 Nodes configured 3 Resources configured Online: [ x3650g x3650h ] Full list of resources: f-g (stonith:external/ibmrsa-telnet): Started x3650h f-h (stonith:external/ibmrsa-telnet): Started x3650g p1 (ocf::pacemaker:Dummy): Stopped Migration summary: * Node x3650g: * Node x3650h: p1: migration-threshold=1 fail-count=1 last-failure='Fri Mar 14 18:32:48 2014' Failed actions: p1_monitor_1 on x3650h 'not running' (7): call=16, status=complete, last-rc-change='Fri Mar 14 18:32:48 2014', queued=0ms, exec=0ms Just for reference, similar phenomenon did not occur by crm_standby. $ crm_standby -U X3650H -v on Best Regards, Kazunori INOUE ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Node in pending state, resources duplicated and data corruption
Hi to all! We've a 4 node cluster and recently experienced a weird issue with Pacemaker that resulted in three database instance resources duplicated (running simultaneously in 2 nodes) and subsequent data corruption. I've been investigating logs and could not arrive to a conclusion as to why did that happened. So I'm writing to the list with details of the event to see if someone can help me pinpoint if there was some problem with our operation of maybe we hit some bug. OS: CentOS 6.4 Pacemaker version: 1.1.8 Stack: cman Stonith enabled via DELL iDRAC in all 4 nodes Nodes: gandalf, isildur, mordor, lorien Timeline of events and logs: - A resource monitor operation times out and resources in that node (gandalf) are being stopped Mar 8 08:41:09 gandalf crmd[31561]:error: process_lrm_event: LRM operation vg_ifx_oltp_monitor_24 (594) Timed Out (timeout=12ms) - Stopping resources in that node (gandalf) also times out and node is being killed by stonith from other node (mordor) Mar 8 08:42:54 gandalf lrmd[31558]: warning: child_timeout_callback: vg_ifx_oltp_stop_0 process (PID 17816) timed out Mar 8 08:42:55 gandalf pengine[31560]: warning: unpack_rsc_op: Processing failed op stop for vg_ifx_oltp on gandalf.san01.cooperativaobrera.coop: unknown error (1) Mar 8 08:42:55 gandalf pengine[31560]: warning: pe_fence_node: Node gandalf.san01.cooperativaobrera.coop will be fenced to recover from resource failure(s) Mar 8 08:43:09 mordor corosync[25977]: [TOTEM ] A processor failed, forming new configuration. Mar 8 08:43:09 mordor stonith-ng[26212]: notice: log_operation: Operation 'reboot' [4612] (call 0 from crmd.31561) for host 'gandalf.san01.cooperativaobrera.coop' with device 'st-gandalf' returned: 0 (OK) Mar 8 08:43:21 mordor corosync[25977]: [QUORUM] Members[3]: 2 3 4 Mar 8 08:43:21 mordor corosync[25977]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Mar 8 08:43:21 mordor crmd[26216]: notice: crm_update_peer_state: cman_event_callback: Node gandalf.san01.cooperativaobrera.coop[1] - state is now lost Mar 8 08:43:21 mordor crmd[26216]: warning: check_dead_member: Our DC node (gandalf.san01.cooperativaobrera.coop) left the cluster Mar 8 08:43:21 mordor kernel: dlm: closing connection to node 1 Mar 8 08:43:21 mordor corosync[25977]: [CPG ] chosen downlist: sender r(0) ip(172.16.1.1) r(1) ip(172.16.2.1) ; members(old:4 left:1) Mar 8 08:43:21 mordor corosync[25977]: [MAIN ] Completed service synchronization, ready to provide service. Mar 8 08:43:21 mordor fenced[26045]: fencing deferred to isildur.san01.cooperativaobrera.coop Mar 8 08:43:22 mordor crmd[26216]: notice: do_state_transition: State transition S_NOT_DC - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=check_dead_member ] Mar 8 08:43:22 mordor crmd[26216]: notice: do_state_transition: State transition S_ELECTION - S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ] Mar 8 08:43:22 mordor stonith-ng[26212]: notice: remote_op_done: Operation reboot of gandalf.san01.cooperativaobrera.coop by mordor.san01.cooperativaobrera.coop for crmd.31...@gandalf.san01.cooperativaobrera.coop.10d27664: OK Mar 8 08:43:22 mordor crmd[26216]: notice: tengine_stonith_notify: Peer gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by mordor.san01.cooperativaobrera.coop for gandalf.san01.cooperativaobrera.coop: OK (ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561 Mar 8 08:43:22 mordor crmd[26216]: notice: tengine_stonith_notify: Notified CMAN that 'gandalf.san01.cooperativaobrera.coop' is now fenced Mar 8 08:43:22 mordor crmd[26216]: notice: tengine_stonith_notify: Target may have been our leader gandalf.san01.cooperativaobrera.coop (recorded: unset) Mar 8 08:43:22 mordor cib[26211]: warning: cib_process_diff: Diff 0.513.82 - 0.513.83 from lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current num_updates is greater than required Mar 8 08:43:22 mordor cib[26211]: warning: cib_process_diff: Diff 0.513.83 - 0.513.84 from lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current num_updates is greater than required Mar 8 08:43:22 mordor cib[26211]: warning: cib_process_diff: Diff 0.513.84 - 0.513.85 from lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current num_updates is greater than required Mar 8 08:43:22 mordor cib[26211]: notice: cib_process_diff: Diff 0.513.85 - 0.513.86 from lorien.san01.cooperativaobrera.coop not applied to 0.513.85: Failed application of an update diff Mar 8 08:43:27 mordor attrd[26214]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Mar 8 08:43:27 mordor attrd[26214]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-srv_mysql_dss (1384966716) Mar 8 08:43:27 mordor crmd[26216]: notice: do_state_transition: State transition S_PENDING -
[Pacemaker] Errors while compiling
Hey everyone! I am trying to compile pacemaker from source for some time - but i keep getting the same errors, despite using different versions. I did the following to get this: 1. ./autogen.sh 2. ./configure --prefix=/opt/cluster/ --disable-fatal-warnings 3. make After that step i always get this error: http://pastebin.com/eXFmhUUD I get this on version 1.10, as on 1.11 Any ideas? -- Stephan Buchner buch...@linux-systeme.de +49 201 - 29 88 319 +49 172 - 7 222 333 Linux-Systeme GmbH Langenbergerstr. 179, 45277 Essen www.linux-systeme.de +49 201 - 29 88 30 Amtsgericht Essen, HRB 14729 Geschäftsführer Jörg Hinz ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Errors while compiling
maybe you are missing crm dev library 2014-03-14 13:39 GMT+01:00 Stephan Buchner buch...@linux-systeme.de: Hey everyone! I am trying to compile pacemaker from source for some time - but i keep getting the same errors, despite using different versions. I did the following to get this: 1. ./autogen.sh 2. ./configure --prefix=/opt/cluster/ --disable-fatal-warnings 3. make After that step i always get this error: http://pastebin.com/eXFmhUUD I get this on version 1.10, as on 1.11 Any ideas? -- Stephan Buchner buch...@linux-systeme.de +49 201 - 29 88 319 +49 172 - 7 222 333 Linux-Systeme GmbH Langenbergerstr. 179, 45277 Essen www.linux-systeme.de +49 201 - 29 88 30 Amtsgericht Essen, HRB 14729 Geschäftsführer Jörg Hinz ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Errors while compiling
Hm, i installed libcrmcluster1-dev and libcrmcommon2-dev on my debian system, still the same error :/ Am 14.03.2014 14:07, schrieb emmanuel segura: maybe you are missing crm dev library 2014-03-14 13:39 GMT+01:00 Stephan Buchner buch...@linux-systeme.de mailto:buch...@linux-systeme.de: Hey everyone! I am trying to compile pacemaker from source for some time - but i keep getting the same errors, despite using different versions. I did the following to get this: 1. ./autogen.sh 2. ./configure --prefix=/opt/cluster/ --disable-fatal-warnings 3. make After that step i always get this error: http://pastebin.com/eXFmhUUD I get this on version 1.10, as on 1.11 Any ideas? -- Stephan Buchner buch...@linux-systeme.de +49 201 - 29 88 319 tel:%2B49%20201%20-%2029%2088%20319 +49 172 - 7 222 333 tel:%2B49%20172%20-%207%20222%20333 Linux-Systeme GmbH Langenbergerstr. 179, 45277 Essen www.linux-systeme.de http://www.linux-systeme.de +49 201 - 29 88 30 tel:%2B49%20201%20-%2029%2088%2030 Amtsgericht Essen, HRB 14729 Geschäftsführer Jörg Hinz ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org mailto:Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Stephan Buchner buch...@linux-systeme.de +49 201 - 29 88 319 +49 172 - 7 222 333 Linux-Systeme GmbH Langenbergerstr. 179, 45277 Essen www.linux-systeme.de +49 201 - 29 88 30 Amtsgericht Essen, HRB 14729 Geschäftsführer Jörg Hinz ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11
- Original Message - From: Kazunori INOUE kazunori.ino...@gmail.com To: pm pacemaker@oss.clusterlabs.org Sent: Friday, March 14, 2014 5:52:38 AM Subject: [Pacemaker] crmd was aborted at pacemaker 1.1.11 Hi, When specifying the node name in UPPER case and performing crm_resource, crmd was aborted. (The real node name is a LOWER case.) https://github.com/ClusterLabs/pacemaker/pull/462 does that fix it? # crm_resource -C -r p1 -N X3650H Cleaning up p1 on X3650H Waiting for 1 replies from the CRMdNo messages received in 60 seconds.. aborting Mar 14 18:33:10 x3650h crmd[10718]:error: crm_abort: do_lrm_invoke: Triggered fatal assert at lrm.c:1240 : lrm_state != NULL ...snip... Mar 14 18:33:10 x3650h pacemakerd[10708]:error: child_waitpid: Managed process 10718 (crmd) dumped core * The state before performing crm_resource. Stack: corosync Current DC: x3650g (3232261383) - partition with quorum Version: 1.1.10-38c5972 2 Nodes configured 3 Resources configured Online: [ x3650g x3650h ] Full list of resources: f-g (stonith:external/ibmrsa-telnet): Started x3650h f-h (stonith:external/ibmrsa-telnet): Started x3650g p1 (ocf::pacemaker:Dummy): Stopped Migration summary: * Node x3650g: * Node x3650h: p1: migration-threshold=1 fail-count=1 last-failure='Fri Mar 14 18:32:48 2014' Failed actions: p1_monitor_1 on x3650h 'not running' (7): call=16, status=complete, last-rc-change='Fri Mar 14 18:32:48 2014', queued=0ms, exec=0ms Just for reference, similar phenomenon did not occur by crm_standby. $ crm_standby -U X3650H -v on Best Regards, Kazunori INOUE ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] drbd + lvm
- Original Message - From: Infoomatic infooma...@gmx.at To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, March 13, 2014 5:28:19 PM Subject: Re: [Pacemaker] drbd + lvm Has anyone had this issue and resolved it? Any ideas? Thanks in advance! Yep, i've hit this as well. Use the latest LVM agent. I already fixed all of this. https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/LVM Keep your volume_list the way it is and use the 'exclusive=true' LVM option. This will allow the LVM agent to activate volumes that don't exist in the volume_list. Hope that helps Thanks for the fast response. I upgraded LVM to the backports (2.02.95-4ubuntu1.1~precise1) and used this script, but I am getting errors when one of the nodes tries to activate the VG. The log: Mar 13 23:21:03 lxc02 LVM[7235]: INFO: 0 logical volume(s) in volume group replicated now active Mar 13 23:21:03 lxc02 LVM[7235]: INFO: LVM Volume replicated is not available (stopped) exclusive is true and the tag is pacemaker. Someone got hints? tia! Yeah, those aren't errors. It's just telling you that the LVM agent stopped successfully. I would expect to see these after you did a failover or resource recovery. Is the resource not starting and stopping correctly for you? If not, I'll need more logs. -- Vossel infoomatic ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org