Re: [Pacemaker] Questions about reasonable cluster size...
On 10/20/2011 03:15 AM, Steven Dake wrote: On 10/19/2011 01:50 PM, Alan Robertson wrote: Hi, I have an application where having a 12-node cluster with about 250 resources would be desirable. Is this reasonable? Can Pacemaker+Corosync be expected to reliably handle a cluster of this size? If not, what is the current recommendation for maximum number of nodes and resources? I start to have problems with 10+ nodes. It`s heavly depended on corosync configuration afaik. You should test it. -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] PM 1.1.5- make errors
Hi all, the next problem I need help ;-( PM Version: 1.1.5 (Pacemaker-1-1-c86cb93c5a57.tar.bz2) - configured with: configure --prefix=$PREFIX --localstatedir=/var --sysconfdir=/etc --with- heartbeat --with-stonith --with-pacemaker --with-daemon-user=$CLUSTER_USER -- with-daemon-group=$CLUSTER_GROUP --enable-fatal-warnings=no --with-ras- set=linux-ha After make I get the following error: ... gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I../include -I../include -I../include - I../libltdl -I../libltdl -I/usr/include/glib-2.0 - I/usr/lib64/glib-2.0/include -I/usr/include/libxml2 -g -O2 -I/usr/include - I/usr/include/heartbeat -ggdb3 -O0 -fgnu89-inline -fstack-protector-all -Wall - Waggregate-return -Wbad-function-cast -Wcast-align -Wdeclaration-after-statement -Wendif-labels -Wfloat-equal -Wformat=2 -Wformat-security -Wformat-nonliteral - Winline -Wmissing-prototypes -Wmissing-declarations -Wnested-externs -Wno-long- long -Wno-strict-aliasing -Wpointer-arith -Wstrict-prototypes -Wwrite-strings - MT te_callbacks.o -MD -MP -MF .deps/te_callbacks.Tpo -c -o te_callbacks.o te_callbacks.c mv -f .deps/te_callbacks.Tpo .deps/te_callbacks.Po /bin/sh ../libtool --tag=CC --tag=CC --mode=link gcc -std=gnu99 -g -O2 - I/usr/include -I/usr/include/heartbeat -ggdb3 -O0 -fgnu89-inline -fstack- protector-all -Wall -Waggregate-return -Wbad-function-cast -Wcast-align - Wdeclaration-after-statement -Wendif-labels -Wfloat-equal -Wformat=2 -Wformat- security -Wformat-nonliteral -Winline -Wmissing-prototypes -Wmissing- declarations -Wnested-externs -Wno-long-long -Wno-strict-aliasing -Wpointer- arith -Wstrict-prototypes -Wwrite-strings -o crmd main.o crmd.o corosync.o fsa.o control.o messages.o ccm.o callbacks.o election.o join_client.o join_dc.o subsystems.o cib.o pengine.o tengine.o lrm.o utils.o misc.o te_events.o te_actions.o te_utils.o te_callbacks.o -lhbclient -lccmclient -llrm ../lib/fencing/libstonithd.la ../lib/transition/libtransitioner.la ../lib/pengine/libpe_rules.la ../lib/cib/libcib.la ../lib/common/libcrmcluster.la ../lib/common/libcrmcommon.la -lplumb -lpils - lbz2 -lxslt -lxml2 -lc -lglib-2.0 -luuid -lrt -ldl -lglib-2.0 -lltdl libtool: link: gcc -std=gnu99 -g -O2 -I/usr/include -I/usr/include/heartbeat - ggdb3 -O0 -fgnu89-inline -fstack-protector-all -Wall -Waggregate-return -Wbad- function-cast -Wcast-align -Wdeclaration-after-statement -Wendif-labels -Wfloat- equal -Wformat=2 -Wformat-security -Wformat-nonliteral -Winline -Wmissing- prototypes -Wmissing-declarations -Wnested-externs -Wno-long-long -Wno-strict- aliasing -Wpointer-arith -Wstrict-prototypes -Wwrite-strings -o .libs/crmd main.o crmd.o corosync.o fsa.o control.o messages.o ccm.o callbacks.o election.o join_client.o join_dc.o subsystems.o cib.o pengine.o tengine.o lrm.o utils.o misc.o te_events.o te_actions.o te_utils.o te_callbacks.o /usr/lib64/liblrm.so ../lib/fencing/.libs/libstonithd.so -L/usr/lib64 -L/lib64 /usr/lib64/libstonith.so ../lib/transition/.libs/libtransitioner.so ../lib/pengine/.libs/libpe_rules.so ../lib/cib/.libs/libcib.so /opt/HA/sourc/Pacemaker-1-1-c86cb93c5a57/lib/pengine/.libs/libpe_rules.so ../lib/common/.libs/libcrmcluster.so /usr/lib64/libhbclient.so /usr/lib64/libccmclient.so /opt/HA/sourc/Pacemaker-1-1- c86cb93c5a57/lib/common/.libs/libcrmcommon.so ../lib/common/.libs/libcrmcommon.so -lgnutls -lgcrypt -lgpg-error /usr/lib64/libplumb.so /usr/lib64/libpils.so -lbz2 /usr/lib64/libxslt.so /usr/lib64/libxml2.so -lz -lm -lc -luuid -lrt -lglib-2.0 /usr/lib64/libltdl.so - ldl control.o: In function `do_ha_control': /opt/HA/sourc/Pacemaker-1-1-c86cb93c5a57/crmd/control.c:69: undefined reference to `terminate_ais_connection' collect2: ld returned 1 exit status gmake[1]: *** [crmd] Fehler 1 gmake[1]: Leaving directory `/opt/HA/sourc/Pacemaker-1-1-c86cb93c5a57/crmd' make: *** [all-recursive] Fehler 1 What is wrong with it? Nikita Michalko ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] When a resource starts all at once in environment using utilization, score may not work
Hi, Yan (2011/09/26 17:46), Gao,Yan wrote: A glance to the transition. After grpPostgreSQLDB3 was assigned to act1, grpPostgreSQLDB1 was chosen to be processed, and it was assigned to act2 (because it had no preference between act2 and act3). And then grpPostgreSQLDB2 went to act3. Thank you for a reply. The flow of the present placement understood it. So the solution might be: After grpPostgreSQLDB3, process grpPostgreSQLDB2 first rather than grpPostgreSQLDB1. Though the problem is: Basing on what policy, we could choose grpPostgreSQLDB2 to process earlier than grpPostgreSQLDB1? Given the processing order was decided before assigning them all, i.e before assigning grpPostgreSQLDB3. Though I thought in various ways, I did not hit on the good thought. For example, I sort the order of resources that are not yet placed again whenever I assign one resource. Will such a correction be difficult? Regards Yuusuke -- METRO SYSTEMS CO., LTD Yuusuke Iida Mail: iiday...@intellilink.co.jp ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Questions about reasonable cluster size...
On 10/20/2011 03:11 AM, Proskurin Kirill wrote: On 10/20/2011 03:15 AM, Steven Dake wrote: On 10/19/2011 01:50 PM, Alan Robertson wrote: Hi, I have an application where having a 12-node cluster with about 250 resources would be desirable. Is this reasonable? Can Pacemaker+Corosync be expected to reliably handle a cluster of this size? If not, what is the current recommendation for maximum number of nodes and resources? Steven Dake wrote: We regularly test 16 nodes. As far as resources go, Andrew could answer that. I start to have problems with 10+ nodes. It`s heavly depended on corosync configuration afaik. You should test it. This is somewhat different from Steven's comment. Exactly what things did you have in mind for the corosync configuration that could either help or hurt with larger clusters? Steven: Proskurin seems to think that there are some particular things to watch out for in the Corosync configuration for larger clusters. Does anything come to mind for you about this? -- Alan Robertsonal...@unix.sh Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions. - William Wilberforce ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Questions about reasonable cluster size...
On 10/20/2011 07:42 AM, Alan Robertson wrote: On 10/20/2011 03:11 AM, Proskurin Kirill wrote: On 10/20/2011 03:15 AM, Steven Dake wrote: On 10/19/2011 01:50 PM, Alan Robertson wrote: Hi, I have an application where having a 12-node cluster with about 250 resources would be desirable. Is this reasonable? Can Pacemaker+Corosync be expected to reliably handle a cluster of this size? If not, what is the current recommendation for maximum number of nodes and resources? Steven Dake wrote: We regularly test 16 nodes. As far as resources go, Andrew could answer that. I start to have problems with 10+ nodes. It`s heavly depended on corosync configuration afaik. You should test it. This is somewhat different from Steven's comment. Exactly what things did you have in mind for the corosync configuration that could either help or hurt with larger clusters? Steven: Proskurin seems to think that there are some particular things to watch out for in the Corosync configuration for larger clusters. Does anything come to mind for you about this? We do 16 node testing with token=1 (10 seconds). The rest of the parameters autoconfigure. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] corosync mailing list address change
Sending one last reminder that the Corosync mailing list has changed homes from the Linux Foundation's servers. I have been unable to obtain the previous subscriber list, so please resubscribe. http://lists.corosync.org/mailman/listinfo The list is called discuss. Regards -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] PM 1.1.5- make errors
Looks like do_ha_control() is calling corosync specific functions when only support for heartbeat is being built. They'd just need to be #ifdef'd out. On Thu, Oct 20, 2011 at 9:54 PM, Nikita Michalko michalko.sys...@a-i-p.com wrote: Hi all, the next problem I need help ;-( PM Version: 1.1.5 (Pacemaker-1-1-c86cb93c5a57.tar.bz2) - configured with: configure --prefix=$PREFIX --localstatedir=/var --sysconfdir=/etc --with- heartbeat --with-stonith --with-pacemaker --with-daemon-user=$CLUSTER_USER -- with-daemon-group=$CLUSTER_GROUP --enable-fatal-warnings=no --with-ras- set=linux-ha After make I get the following error: ... gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I../include -I../include -I../include - I../libltdl -I../libltdl -I/usr/include/glib-2.0 - I/usr/lib64/glib-2.0/include -I/usr/include/libxml2 -g -O2 -I/usr/include - I/usr/include/heartbeat -ggdb3 -O0 -fgnu89-inline -fstack-protector-all -Wall - Waggregate-return -Wbad-function-cast -Wcast-align -Wdeclaration-after-statement -Wendif-labels -Wfloat-equal -Wformat=2 -Wformat-security -Wformat-nonliteral - Winline -Wmissing-prototypes -Wmissing-declarations -Wnested-externs -Wno-long- long -Wno-strict-aliasing -Wpointer-arith -Wstrict-prototypes -Wwrite-strings - MT te_callbacks.o -MD -MP -MF .deps/te_callbacks.Tpo -c -o te_callbacks.o te_callbacks.c mv -f .deps/te_callbacks.Tpo .deps/te_callbacks.Po /bin/sh ../libtool --tag=CC --tag=CC --mode=link gcc -std=gnu99 -g -O2 - I/usr/include -I/usr/include/heartbeat -ggdb3 -O0 -fgnu89-inline -fstack- protector-all -Wall -Waggregate-return -Wbad-function-cast -Wcast-align - Wdeclaration-after-statement -Wendif-labels -Wfloat-equal -Wformat=2 -Wformat- security -Wformat-nonliteral -Winline -Wmissing-prototypes -Wmissing- declarations -Wnested-externs -Wno-long-long -Wno-strict-aliasing -Wpointer- arith -Wstrict-prototypes -Wwrite-strings -o crmd main.o crmd.o corosync.o fsa.o control.o messages.o ccm.o callbacks.o election.o join_client.o join_dc.o subsystems.o cib.o pengine.o tengine.o lrm.o utils.o misc.o te_events.o te_actions.o te_utils.o te_callbacks.o -lhbclient -lccmclient -llrm ../lib/fencing/libstonithd.la ../lib/transition/libtransitioner.la ../lib/pengine/libpe_rules.la ../lib/cib/libcib.la ../lib/common/libcrmcluster.la ../lib/common/libcrmcommon.la -lplumb -lpils - lbz2 -lxslt -lxml2 -lc -lglib-2.0 -luuid -lrt -ldl -lglib-2.0 -lltdl libtool: link: gcc -std=gnu99 -g -O2 -I/usr/include -I/usr/include/heartbeat - ggdb3 -O0 -fgnu89-inline -fstack-protector-all -Wall -Waggregate-return -Wbad- function-cast -Wcast-align -Wdeclaration-after-statement -Wendif-labels -Wfloat- equal -Wformat=2 -Wformat-security -Wformat-nonliteral -Winline -Wmissing- prototypes -Wmissing-declarations -Wnested-externs -Wno-long-long -Wno-strict- aliasing -Wpointer-arith -Wstrict-prototypes -Wwrite-strings -o .libs/crmd main.o crmd.o corosync.o fsa.o control.o messages.o ccm.o callbacks.o election.o join_client.o join_dc.o subsystems.o cib.o pengine.o tengine.o lrm.o utils.o misc.o te_events.o te_actions.o te_utils.o te_callbacks.o /usr/lib64/liblrm.so ../lib/fencing/.libs/libstonithd.so -L/usr/lib64 -L/lib64 /usr/lib64/libstonith.so ../lib/transition/.libs/libtransitioner.so ../lib/pengine/.libs/libpe_rules.so ../lib/cib/.libs/libcib.so /opt/HA/sourc/Pacemaker-1-1-c86cb93c5a57/lib/pengine/.libs/libpe_rules.so ../lib/common/.libs/libcrmcluster.so /usr/lib64/libhbclient.so /usr/lib64/libccmclient.so /opt/HA/sourc/Pacemaker-1-1- c86cb93c5a57/lib/common/.libs/libcrmcommon.so ../lib/common/.libs/libcrmcommon.so -lgnutls -lgcrypt -lgpg-error /usr/lib64/libplumb.so /usr/lib64/libpils.so -lbz2 /usr/lib64/libxslt.so /usr/lib64/libxml2.so -lz -lm -lc -luuid -lrt -lglib-2.0 /usr/lib64/libltdl.so - ldl control.o: In function `do_ha_control': /opt/HA/sourc/Pacemaker-1-1-c86cb93c5a57/crmd/control.c:69: undefined reference to `terminate_ais_connection' collect2: ld returned 1 exit status gmake[1]: *** [crmd] Fehler 1 gmake[1]: Leaving directory `/opt/HA/sourc/Pacemaker-1-1-c86cb93c5a57/crmd' make: *** [all-recursive] Fehler 1 What is wrong with it? Nikita Michalko ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
Hi Alan, Thank you for comment. We reproduce a problem, too and are going to send a report. However, the problem does not reappear for the moment. Best Regards, Hideo Yamauchi. --- On Thu, 2011/10/20, Alan Robertson al...@unix.sh wrote: Hi, I've seen a very similar problem in a recent release. In fact, I'm in the process of reproducing it so that it can be properly logged and so on. When I get the right data for the bug report, I'll attach it to the bug. FWIW: I'm pretty sure that the signal was properly received by attrd. I haven't looked at the attrd code, but my guess is that either it didn't issue the correct function call for exiting from mainloop - or that the mainloop code didn't actually exit. FWIW - it probably doesn't matter at all what the priority for signal handling is - since attrd consumes nearly no CPU. Too bad it doesn't log receiving the signal or beginning the process of exiting... Another random thought - I suppose attrd could be clobbering some memory which mainloop needs to properly process an exit. Doesn't seem likely - but neither of the above options seem very likely either. An historical note on an early bug that had similar symptoms (but affected every process - not just attrd). First - what caused such a problem (a very long time ago): There is a window between the checking for signals and going to sleep in the poll call where such that a signal might be ignored for a while. The glib mainloop code has three entry points called each time a signal is received: prepare, check, dispatch. There is a poll call which occurs between the prepare and check steps. If a signal comes in after the prepare call returns, but before the code goes to sleep in the poll system call, it will be ignored until the poll system call returns. It will get caught on the next iteration of the loop. The fix was fairly simple - the signal handling code instructs the mainloop infrastructure to call poll with an argument which prevents it from staying asleep longer than a second. Then the code processes the signal correctly. On 10/17/2011 07:19 PM, renayama19661...@ybb.ne.jp wrote: Hi, We sometimes fail in a stop of attrd. Step1. start a cluster in 2 nodes Step2. stop the first node.(/etc/init.d/heartbeat stop.) Step3. stop the second node after time passed a little.(/etc/init.d/heartbeat stop.) The attrd catches the TERM signal, but does not stop. (snip) Oct 5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0) Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel to 12238 is not connected Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel: Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 failed Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed Oct 5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing /usr/lib64/heartbeat/attrd process group 12237 with signal 15 Oct 5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 operations (4123.00us average, 0% utilization) in the last 10min Oct 5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC channel took 1010 ms ( 100 ms) Oct 5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC channel took 1010 ms ( 100 ms) Oct 5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: Dispatch function for check for signals was delayed 1030 ms ( 1010 ms) before being called (GSource: 0xd28010) Oct 5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: started at 431583547 should have started at 431583444 Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 1030 ms ( 1010 ms) before being called (GSource: 0xd27dd0) Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: started at 431584254 should have started at 431584151 Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: Dispatch function for check for signals was delayed 1030 ms ( 1010 ms) before being called (GSource: 0xd28010) Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: started at 431584254 should have started at 431584151 Oct 5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working on write child took 1010 ms ( 100 ms) Oct 5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on Heartbeat API channel took 1010 ms ( 100 ms) Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 1030 ms ( 1010 ms) before being called (GSource: 0xd27dd0) Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: info:
[Pacemaker] AUTO: Georg Dierkes is out of the office (returning 25.10.2011)
I am out of the office until 25.10.2011. Note: This is an automated response to your message Pacemaker Digest, Vol 47, Issue 61 sent on 10/20/2011 6:30:40 PM. This is the only notification you will receive while this person is away. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker