Re: [Pacemaker] stop problem and crm node delete nodename is bug?
Date: Tue, 28 Sep 2010 12:27:47 +0200 From: Andrew Beekhof To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] pacemaker stop problem Message-ID: Content-Type: text/plain; charset="iso-8859-1" On Tue, Sep 28, 2010 at 10:00 AM, jiaju liu wrote: > hi guys > I use command service openais force-stop to stop openais, It ofen waste a > long time to stop or maybe run this command and no end. sometimes I > use command service openais force-stop twice it will be ok, or I have to > kill pocess. who has a better way to stop service. > > More than likely openais is waiting for pacemaker, and pacemaker is waiting for one of your cluster services to stop. Figure out why thats taking so long and you'll solve the issue. Are there any command to turn off pacemaker??? and I find some problem about crm node delete nodename. If your delete node is DC the cluster will not vote new DC, so DC is NONE. Besides these ,although the cluster could not see this node ,the node could use crm_mon to see cluster. and If filesystem resource is running on this node, after execute command crm node delete nodename, the filesystem will not umount, so this resource could not migrate to other node in cluster I think this is a bug. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] crm resource move doesn't move the resource
On 28 September 2010 15:09, Pavlos Parissis wrote: > Hi, > > > When I issue "crm resource move pbx_service_01 node-0N" it moves this > resource group but the fs_01 resource is not started because drbd_01 is > still running on other node and it is not moved as well tonode-0N, even I > have colocation constraints. > I am pretty sure that I have that working before, but I can't figure why it > doesn't work anymore. > The resource pbx_service_01 and drbd_01 are moved to another node in case > of failure, but for some reason not manually. > > Can you see in my conf where it could be the problem? I have already spent > some time and I think I can't see the obvious anymore:-( > > [...snip ...] Just to that this issue is applicable only for one of the resource group, even the conf is the same for both of them! So, after hours of running the same test again and again, and reading 10 lines of logs (BTW it seams that they say in a clear way why certain things happen) I decided to recreate the drbd_01 and ms-drbd_01 resource and adjust the order constraints before it was like this order fs_01-after-drbd_01 inf: ms-drbd_01:promote fs_01:start order fs_02-after-drbd_02 inf: ms-drbd_02:promote fs_02:start order pbx_01-after-fs_01 inf: fs_01 pbx_01 order pbx_01-after-ip_01 inf: ip_01 pbx_01 order pbx_02-after-fs_02 inf: fs_02 pbx_02 order pbx_02-after-ip_02 inf: ip_02 pbx_02 and now like this order fs_02-after-drbd_02 inf: ms-drbd_02:promote fs_02:start order pbx_02-after-fs_02 inf: fs_02 pbx_02 order pbx_02-after-ip_02 inf: ip_02 pbx_02 order pbx_service_01-after-drbd_01 inf: ms-drbd_01:promote pbx_service_01:start* * as you can see no major changes. The end result is that now every time I issue "crm resource move pbx_service_01 node-0N" the drbd_01 is promoted on that node as well and the whole resource group is started! So, issue is solved but I don't like it for the very simple reason, I don't why it didn't work, and that scares me! Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] [Problem]Lost fail-count.
Hi, We examined the trouble outbreak of a resource during cluster division and the recovery of the cluster. However, at the time of cluster recovery, the phenomenon that fail-count disappeared occurred. Failed-Actions did not disappear then. In the next procedure, it occurred. Step1)We start Heartbeat. Step2)We stand alone in iptables in a cgl60 node. Step3)When a sfex resource started in a cgl63 node, we remove the isolation of the cgl60 node. Step4)In a cgl63 node, a start of VIPcheck,sfex becomes the error. * VIPcheck,sfex becomes the resource to detect double start. Step5)fail-count is lost. Last updated: Thu Sep 16 17:26:10 2010 Stack: Heartbeat Current DC: cgl63 (16349f88-0203-40d1-ba48-b7a5c4547a26) - partition with quorum Version: 1.0.9-74392a28b7f3 stable-1.0 tip 4 Nodes configured, unknown expected votes 10 Resources configured. Online: [ cgl60 cgl61 cgl62 cgl63 ] Resource Group: UMgroup01 UmVIPcheck (ocf::heartbeat:VIPcheck): Started cgl60 UmIPaddr (ocf::heartbeat:IPaddr2): Started cgl60 UmDummy01 (ocf::pacemaker:Dummy): Started cgl60 UmDummy02 (ocf::pacemaker:Dummy): Started cgl60 Resource Group: OVDBgroup02-1 prmExPostgreSQLDB1 (ocf::heartbeat:sfex): Started cgl60 prmFsPostgreSQLDB1-1 (ocf::heartbeat:Filesystem):Started cgl60 prmFsPostgreSQLDB1-2 (ocf::heartbeat:Filesystem):Started cgl60 prmFsPostgreSQLDB1-3 (ocf::heartbeat:Filesystem):Started cgl60 prmIpPostgreSQLDB1 (ocf::heartbeat:IPaddr2): Started cgl60 prmApPostgreSQLDB1 (ocf::heartbeat:pgsql): Started cgl60 Resource Group: OVDBgroup02-2 prmExPostgreSQLDB2 (ocf::heartbeat:sfex): Started cgl61 prmFsPostgreSQLDB2-1 (ocf::heartbeat:Filesystem):Started cgl61 prmFsPostgreSQLDB2-2 (ocf::heartbeat:Filesystem):Started cgl61 prmFsPostgreSQLDB2-3 (ocf::heartbeat:Filesystem):Started cgl61 prmIpPostgreSQLDB2 (ocf::heartbeat:IPaddr2): Started cgl61 prmApPostgreSQLDB2 (ocf::heartbeat:pgsql): Started cgl61 Resource Group: OVDBgroup02-3 prmExPostgreSQLDB3 (ocf::heartbeat:sfex): Started cgl62 prmFsPostgreSQLDB3-1 (ocf::heartbeat:Filesystem):Started cgl62 prmFsPostgreSQLDB3-2 (ocf::heartbeat:Filesystem):Started cgl62 prmFsPostgreSQLDB3-3 (ocf::heartbeat:Filesystem):Started cgl62 prmIpPostgreSQLDB3 (ocf::heartbeat:IPaddr2): Started cgl62 prmApPostgreSQLDB3 (ocf::heartbeat:pgsql): Started cgl62 (snip) Migration summary: * Node cgl60: * Node cgl61: * Node cgl62: * Node cgl63: -> Lost fail-count. Failed actions: prmExPostgreSQLDB1_start_0 (node=cgl63, call=46, rc=1, status=complete): unknown error UmVIPcheck_start_0 (node=cgl63, call=45, rc=1, status=complete): unknown error The trouble of the start processing seems to detect it when we watch log. Sep 16 17:25:29 cgl63 crmd: [9757]: info: process_lrm_event: LRM operation prmExPostgreSQLDB1_start_0 (call=46, rc=1, cib-update=91, confirmed=true) unknown error What is the cause of the disappearance of fail-count? I attach log. * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2496 Best Regard, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] About behavior in "Action Lost".
Sorry, it probably got rebased before I pushed it. http://hg.clusterlabs.org/pacemaker/1.1/rev/dd8e37df3e96 should be the right link On Wed, Sep 29, 2010 at 2:51 AM, wrote: > Hi Andrew, > >> Pushed as: >> http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 >> >> Not sure about applying to 1.0 though, its a dramatic change in behavior. > > The change of this link is not found. > Where did you update it? > > Best Regards, > Hideo Yamauchi. > > --- Andrew Beekhof wrote: > >> Pushed as: >> http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 >> >> Not sure about applying to 1.0 though, its a dramatic change in behavior. >> >> On Wed, Sep 22, 2010 at 11:18 AM, wrote: >> > Hi Andrew, >> > >> > Thank you for comment. >> > >> >> A long time ago in a galaxy far away, some messaging layers used to >> >> loose quite a few actions, including stops. >> >> About the same time, we decided that fencing because a stop action was >> >> lost wasn't a good idea. >> >> >> >> The rationale was that if the operation eventually completed, it would >> >> end up in the CIB anyway. >> >> And even if it didn't, the PE would continue to try the operation >> >> again until the whole node fell over at which point it would get shot >> >> anyway. >> > >> > Sorry... >> > I did not know the fact that there was such an argument in old days. >> > >> > >> >> Now, having said that, things have improved since then and perhaps, >> >> the interest of speeding up recovery in these situations, it is time >> >> to stop treating stop operations differently. >> >> Would you agree? >> > >> > That means, you change it in the case of "Action Lost" of the stop this >> > time to carry out >> stonith? >> > If my recognition is right, I agree too. >> > >> > if(timer->action->type != action_type_rsc) { >> > send_update = FALSE; >> > } else if(safe_str_eq(task, "cancel")) { >> > /* we dont need to update the CIB with these */ >> > send_update = FALSE; >> > } >> > ---> delete "else if(safe_str_eq(task, "stop")){..}" ? >> > >> > if(send_update) { >> > /* cib_action_update(timer->action, LRM_OP_PENDING, >> > EXECRA_STATUS_UNKNOWN); */ >> > cib_action_update(timer->action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR); >> > } >> > >> > Best Regards, >> > Hideo Yamauchi. >> > >> > --- Andrew Beekhof wrote: >> > >> >> On Tue, Sep 21, 2010 at 8:59 AM, � >> >> wrote: >> >> > Hi, >> >> > >> >> > Node was in state that the load was very high, and we confirmed monitor >> >> > movement of >> Pacemeker. >> >> > Action Lost occurred in stop movement after the error of the monitor >> >> > occurred. >> >> > >> >> > Sep �8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting >> >> > transition, action >> lost: >> >> [Action 9]: >> >> > In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0) >> >> > Sep �8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: >> action_timer_callback:486 >> > - >> >> > Triggered transition abort (complete=0) : Action lost >> >> > >> >> > >> >> > For the load of the node, We think that the stop movement did not go >> >> > well. >> >> > But cannot nodes execute stonith. >> >> >> >> A long time ago in a galaxy far away, some messaging layers used to >> >> loose quite a few actions, including stops. >> >> About the same time, we decided that fencing because a stop action was >> >> lost wasn't a good idea. >> >> >> >> The rationale was that if the operation eventually completed, it would >> >> end up in the CIB anyway. >> >> And even if it didn't, the PE would continue to try the operation >> >> again until the whole node fell over at which point it would get shot >> >> anyway. >> >> >> >> Now, having said that, things have improved since then and perhaps, >> >> the interest of speeding up recovery in these situations, it is time >> >> to stop treating stop operations differently. >> >> Would you agree? >> >> >> >> ___ >> >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> >> >> Project Home: http://www.clusterlabs.org >> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> >> Bugs: >> >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> >> >> > >> > >> > ___ >> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> > >> > Project Home: http://www.clusterlabs.org >> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> > Bugs: >> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> > >> >> ___ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: >> http:/
[Pacemaker] Doc build issue
Hi! This patch breaks rpm build and seems to be unneeded (at least on F13) Italian docs are generated without it. http://hg.clusterlabs.org/pacemaker/1.1/diff/ac25a4ecdbcb/doc/Clusters_from_Scratch/publican.cfg.in Symptoms: $ make Clusters_from_Scratch.txt Building Clusters_from_Scratch rm -rf Clusters_from_Scratch/publish/* cd Clusters_from_Scratch && /usr/bin/publican build --publish --langs=all --formats=html-desktop,txt Can't locate required file: ARRAY(0x3deccd8)/Book_Info.xml at /usr/bin/publican line 514 make: *** [Clusters_from_Scratch.txt] Error 2 Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] cib
Hi, I did a bt on the core, this is what I found: == Core was generated by `/usr/lib64/heartbeat/cib'. Program terminated with signal 11, Segmentation fault. [New process 12340] #0 0x7f23acc553fa in strncmp () from /lib64/libc.so.6 (gdb) bt #0 0x7f23acc553fa in strncmp () from /lib64/libc.so.6 #1 0x7f23acf87c39 in __xmlParserInputBufferCreateFilename () from /usr/lib64/libxml2.so.2 #2 0x7f23acf6147b in xmlNewInputFromFile () from /usr/lib64/libxml2.so.2 #3 0x7f23acf641d4 in xmlCreateURLParserCtxt () from /usr/lib64/libxml2.so.2 #4 0x7f23acf78f3a in xmlReadFile () from /usr/lib64/libxml2.so.2 #5 0x7f23ad0167b1 in xmlRelaxNGParse () from /usr/lib64/libxml2.so.2 #6 0x7f23ae967321 in validate_with_relaxng (doc=0x626020, to_logs=1, relaxng_file=0x7f23ae97ba10 "/usr/share/pacemaker/pacemaker-1.2.rng") at xml.c: #7 0x7f23ae967769 in validate_with (xml=0x6260d0, method=6, to_logs=1) at xml.c:2287 #8 0x7f23ae967b9f in validate_xml (xml_blob=0x6260d0, validation=0x626910 "pacemaker-1.2", to_logs=1) at xml.c:2373 #9 0x00405b23 in readCibXmlFile (dir=0x41b580 "/var/lib/heartbeat/crm", file=0x41c40a "cib.xml", discard_status=1) at io.c:396 #10 0x00412285 in startCib (filename=0x41c40a "cib.xml") at main.c:613 #11 0x00411309 in cib_init () at main.c:408 #12 0x0041064a in main (argc=1, argv=0x7fff942e0f58) at main.c:218 == If it's a fresh install let's say then cib.xml will not exist. Then why is it looking for this file on startup. Sincerely Shravan On Tue, Sep 28, 2010 at 10:24 AM, Shravan Mishra wrote: > Sorry forgot to attach my corosync.conf. > > > = > totem { > version: 2 > # token: 3000 > # token_retransmits_before_loss_const: 10 > # join: 60 > # consensus: 1500 > # vsftype: none > # max_messages: 20 > # clear_node_high_bit: yes > secauth: off > threads: 0 > # rrp_mode: passive > > interface { > ringnumber: 0 > bindnetaddr: 192.168.2.0 > #mcastaddr: 226.94.1.1 > broadcast: yes > mcastport: 5405 > } > # interface { > # ringnumber: 1 > # bindnetaddr: 172.20.20.0 > #mcastaddr: 226.94.1.1 > # broadcast: yes > # mcastport: 5405 > # } > } > > logging { > fileline: off > to_stderr: yes > to_logfile: yes > to_syslog: yes > logfile: /tmp/corosync.log > debug: off > timestamp: on > logger_subsys { > subsys: AMF > debug: off > } > } > > service { > name: pacemaker > ver: 0 > } > > aisexec { > user:root > group: root > } > > amf { > mode: disabled > } > > > > > = > > On Tue, Sep 28, 2010 at 10:10 AM, Shravan Mishra > wrote: >> Hi Andrew, >> >> I'm attaching another log file as I reflashed my machine started >> everything from scratch. >> Looks like my old system got little messed up as I was trying to >> install old HA libraries - corosyc/pacemaker that was initially >> working for me. >> >> >> Here are the details: >> >> As of now I just want to see cib/attrd up so I have only one machine >> where I want to see things in a sane state. >> >> [r...@ha2 ~]# /usr/sbin/corosync -v >> Corosync Cluster Engine, version '1.2.8' SVN revision '3035' >> Copyright (c) 2006-2009 Red Hat, Inc. >> >> [r...@ha2 ~]# /usr/lib64/heartbeat/crmd version >> CRM Version: 1.1.2 (e0d731c2b1be446b27a73327a53067bf6230fb6a) >> >> >> >> Pacemaker version is 1.1, the release based on the above output is >> 1.1.2 if I correctly understand. >> >> This one is showing -- >> >> Sep 27 12:30:45 corosync [pcmk ] ERROR: pcmk_wait_dispatch: Child >> process cib terminated with signal 11 (pid=9216, core=false) >> >> >> Please find corosync logs attached. >> >> Thanks >> Shravan >> >> >> On Tue, Sep 28, 2010 at 5:47 AM, Andrew Beekhof wrote: >>> On Mon, Sep 27, 2010 at 6:26 AM, Shravan Mishra >>> wrote: Thanks Raoul for the response. Changing the permission to hacluster:haclient did stop that error. Now I'm hitting another problem whereby cib is failing to start >>> >>> Very strange logs. >>> Which distribution is this? >>> What does your corosync.conf look like? >>> >>> = Sep 27 00:16:29 corosync [pcmk ] info: update_member: Node ha2.itactics.com now has process list: 00110012 (1114130) Sep 27 00:16:29 corosync [pcmk ] info: update_member: Node ha2.itactics.com now has 1 quorum votes (was 0) Sep 27 00:16:29 corosync [pcmk ] info: send_member_notification: Sending membership update 100 to 0 children Sep 27 00:16:29 corosync [MAIN ] Completed service synchronization, ready to provide service. Sep 27 00:16:30 corosync [pcmk ] ERROR: pcmk_wait_disp
Re: [Pacemaker] Doc build issue
On Wed, Sep 29, 2010 at 3:58 PM, Vladislav Bogdanov wrote: > Hi! > > This patch breaks rpm build and seems to be unneeded (at least on F13) > Italian docs are generated without it. oh, is that why it keeps breaking. Thanks for investigating! :-) > > http://hg.clusterlabs.org/pacemaker/1.1/diff/ac25a4ecdbcb/doc/Clusters_from_Scratch/publican.cfg.in > > Symptoms: > $ make Clusters_from_Scratch.txt > Building Clusters_from_Scratch > rm -rf Clusters_from_Scratch/publish/* > cd Clusters_from_Scratch && /usr/bin/publican build --publish > --langs=all --formats=html-desktop,txt > Can't locate required file: ARRAY(0x3deccd8)/Book_Info.xml at > /usr/bin/publican line 514 > make: *** [Clusters_from_Scratch.txt] Error 2 > > Best, > Vladislav > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] cib
Some more info: root 14170 14166 0 12:23 ?00:00:00 /usr/lib64/heartbeat/stonithd nobody 14172 14166 0 12:23 ?00:00:00 /usr/lib64/heartbeat/lrmd 82 14173 14166 0 12:23 ?00:00:00 /usr/lib64/heartbeat/attrd 82 14174 14166 0 12:23 ?00:00:00 /usr/lib64/heartbeat/pengine 82 14175 14166 0 12:23 ?00:00:00 /usr/lib64/heartbeat/crmd --lrmd is running as nobody when it should have been root. I'm not sure why that would happen. Thanks Shravan On Wed, Sep 29, 2010 at 10:29 AM, Shravan Mishra wrote: > Hi, > > > > I did a bt on the core, this is what I found: > > > == > Core was generated by `/usr/lib64/heartbeat/cib'. > Program terminated with signal 11, Segmentation fault. > [New process 12340] > #0 0x7f23acc553fa in strncmp () from /lib64/libc.so.6 > (gdb) bt > #0 0x7f23acc553fa in strncmp () from /lib64/libc.so.6 > #1 0x7f23acf87c39 in __xmlParserInputBufferCreateFilename () from > /usr/lib64/libxml2.so.2 > #2 0x7f23acf6147b in xmlNewInputFromFile () from /usr/lib64/libxml2.so.2 > #3 0x7f23acf641d4 in xmlCreateURLParserCtxt () from > /usr/lib64/libxml2.so.2 > #4 0x7f23acf78f3a in xmlReadFile () from /usr/lib64/libxml2.so.2 > #5 0x7f23ad0167b1 in xmlRelaxNGParse () from /usr/lib64/libxml2.so.2 > #6 0x7f23ae967321 in validate_with_relaxng (doc=0x626020, to_logs=1, > relaxng_file=0x7f23ae97ba10 > "/usr/share/pacemaker/pacemaker-1.2.rng") at xml.c: > #7 0x7f23ae967769 in validate_with (xml=0x6260d0, method=6, > to_logs=1) at xml.c:2287 > #8 0x7f23ae967b9f in validate_xml (xml_blob=0x6260d0, > validation=0x626910 "pacemaker-1.2", > to_logs=1) at xml.c:2373 > #9 0x00405b23 in readCibXmlFile (dir=0x41b580 > "/var/lib/heartbeat/crm", > file=0x41c40a "cib.xml", discard_status=1) at io.c:396 > #10 0x00412285 in startCib (filename=0x41c40a "cib.xml") at main.c:613 > #11 0x00411309 in cib_init () at main.c:408 > #12 0x0041064a in main (argc=1, argv=0x7fff942e0f58) at main.c:218 > > > == > > > > If it's a fresh install let's say then cib.xml will not exist. > Then why is it looking for this file on startup. > > > Sincerely > Shravan > > > On Tue, Sep 28, 2010 at 10:24 AM, Shravan Mishra > wrote: >> Sorry forgot to attach my corosync.conf. >> >> >> = >> totem { >> version: 2 >> # token: 3000 >> # token_retransmits_before_loss_const: 10 >> # join: 60 >> # consensus: 1500 >> # vsftype: none >> # max_messages: 20 >> # clear_node_high_bit: yes >> secauth: off >> threads: 0 >> # rrp_mode: passive >> >> interface { >> ringnumber: 0 >> bindnetaddr: 192.168.2.0 >> #mcastaddr: 226.94.1.1 >> broadcast: yes >> mcastport: 5405 >> } >> # interface { >> # ringnumber: 1 >> # bindnetaddr: 172.20.20.0 >> #mcastaddr: 226.94.1.1 >> # broadcast: yes >> # mcastport: 5405 >> # } >> } >> >> logging { >> fileline: off >> to_stderr: yes >> to_logfile: yes >> to_syslog: yes >> logfile: /tmp/corosync.log >> debug: off >> timestamp: on >> logger_subsys { >> subsys: AMF >> debug: off >> } >> } >> >> service { >> name: pacemaker >> ver: 0 >> } >> >> aisexec { >> user:root >> group: root >> } >> >> amf { >> mode: disabled >> } >> >> >> >> >> = >> >> On Tue, Sep 28, 2010 at 10:10 AM, Shravan Mishra >> wrote: >>> Hi Andrew, >>> >>> I'm attaching another log file as I reflashed my machine started >>> everything from scratch. >>> Looks like my old system got little messed up as I was trying to >>> install old HA libraries - corosyc/pacemaker that was initially >>> working for me. >>> >>> >>> Here are the details: >>> >>> As of now I just want to see cib/attrd up so I have only one machine >>> where I want to see things in a sane state. >>> >>> [r...@ha2 ~]# /usr/sbin/corosync -v >>> Corosync Cluster Engine, version '1.2.8' SVN revision '3035' >>> Copyright (c) 2006-2009 Red Hat, Inc. >>> >>> [r...@ha2 ~]# /usr/lib64/heartbeat/crmd version >>> CRM Version: 1.1.2 (e0d731c2b1be446b27a73327a53067bf6230fb6a) >>> >>> >>> >>> Pacemaker version is 1.1, the release based on the above output is >>> 1.1.2 if I correctly understand. >>> >>> This one is showing -- >>> >>> Sep 27 12:30:45 corosync [pcmk ] ERROR: pcmk_wait_dispatch: Child >>> process cib terminated with signal 11 (pid=9216, core=false) >>> >>> >>> Please find corosync logs attached. >>> >>> Thanks >>> Shravan >>> >>> >>> On Tue, Sep 28, 2010 at 5:47 AM, Andrew Beekhof wrote: On Mon, Sep 27, 2010 at 6:26 AM, Shravan Mishra wrote: > Thanks Raoul for the response. > > Changing the permission to hacluster:haclient did stop
[Pacemaker] Does bond0 network interface work with corosync/pacemaker
We have two nodes that we have the IP address assigned to a bond0 network interface instead of the usual eth0 network interface. We are wondering if there are issues with trying to configure corosync/pacemaker with an IP assigned to a bond0 network interface. We are seeing that corosync/pacemaker will start on both nodes, but it doesn't detect other nodes in the cluster. We do have SELinux and the firewall shut off on both nodes. Any information would be helpful. Thanks, Mike - This e-mail message is intended only for the personal use of the recipient(s) named above. If you are not an intended recipient, you may not review, copy or distribute this message. If you have received this communication in error, please notify the CDS Global Help Desk (cdshelpd...@cds-global.com) immediately by e-mail and delete the original message. - ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Does bond0 network interface work with corosync/pacemaker
Please paste the conf of corosync, without suppling the conf is quite difficult to help you Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Does bond0 network interface work with corosync/pacemaker
Here you go. # Please read the corosync.conf.5 manual page compatibility: whitetank totem { version: 2 secauth: off threads: 0 interface { ringnumber: 0 bindnetaddr: 172.26.2.167 mcastaddr: 226.94.1.1 mcastport: 5405 } } logging { fileline: off to_stderr: no to_logfile: yes to_syslog: yes logfile: /var/log/cluster/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } amf { mode: disabled } Mike From: Pavlos Parissis To: pacemaker@oss.clusterlabs.org Date: 09/29/2010 01:51 PM Subject: Re: [Pacemaker] Does bond0 network interface work with corosync/pacemaker Please paste the conf of corosync, without suppling the conf is quite difficult to help you Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker - This e-mail message is intended only for the personal use of the recipient(s) named above. If you are not an intended recipient, you may not review, copy or distribute this message. If you have received this communication in error, please notify the CDS Global Help Desk (cdshelpd...@cds-global.com) immediately by e-mail and delete the original message. - ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Does bond0 network interface work with corosync/pacemaker
On 29.09.2010 19:59, Mike A Meyer wrote: We have two nodes that we have the IP address assigned to a bond0 network interface instead of the usual eth0 network interface. We are wondering if there are issues with trying to configure corosync/pacemaker with an IP assigned to a bond0 network interface. We are seeing that corosync/pacemaker will start on both nodes, but it doesn't detect other nodes in the cluster. We do have SELinux and the firewall shut off on both nodes. Any information would be helpful. We run the cluster stuff on bonding devices (actually on a VLan on top of a bond) and it works well. We use it in a two-node setup in round-robin mode, the nodes are connected back-to-back (i.e. no Switch in between). If you use bonding over a Switch, check your bonding mode - round-robin just won't work. Try LACP if you have connected each node to a single switch or if your Switches support link aggregation over multiple Devices (the cheaper ones won't). Try "active-backup" with multiple switches. To check your configuration, use "ping" and check the "icmp_seq" in the replies. If some sequence number is missing, your setup is probably broken. Ciao Andi ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Does bond0 network interface work with corosync/pacemaker
On 29 September 2010 21:01, Andreas Hofmeister wrote: > On 29.09.2010 19:59, Mike A Meyer wrote: > > We have two nodes that we have the IP address assigned to a bond0 network > interface instead of the usual eth0 network interface. We are wondering if > there are issues with trying to configure corosync/pacemaker with an IP > assigned to a bond0 network interface. We are seeing that > corosync/pacemaker will start on both nodes, but it doesn't detect other > nodes in the cluster. We do have SELinux and the firewall shut off on both > nodes. Any information would be helpful. > > > We run the cluster stuff on bonding devices (actually on a VLan on top of a > bond) and it works well. We use it in a two-node setup in round-robin mode, > the nodes are connected back-to-back (i.e. no Switch in between). > > If you use bonding over a Switch, check your bonding mode - round-robin > just won't work. Try LACP if you have connected each node to a single > switch or if your Switches support link aggregation over multiple Devices > (the cheaper ones won't). Try "active-backup" with multiple switches. > > To check your configuration, use "ping" and check the "icmp_seq" in the > replies. If some sequence number is missing, your setup is probably broken. > > It is quite common to connect both interfaces of a bond on the same switch and then face issues. Mike you need to tell us a bit more on the layer 2 connectivity and how it does look like. We also use active-backup mode on our bond interfaces, but we use 2 switches and it works without any problem Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] stonith-ng message in /var/log/messages
Ron Kerry writes: > I am seeing the following sequence of messages with every monitor interval for my stonith resource. > > Sep 28 10:44:01 genesis stonith-ng: [9493]: ERROR: run_stonith_agent: No timeout set for stonith > operation monitor with device fence_legacy > Sep 28 10:44:01 genesis stonith: l2network device OK. > > It is unclear to me what this ERROR means as the resource itself says everything is fine. There is a > monitor timeout set in the resource definition. > > Distribution is SLES11SP1 (SLE11SP1-HAE). > cluster-glue 1.0.6-0.3.7 I'm seeing the same problem ever since the latest update rollup from Novell (the "sleshasp1-ha-update-201009" patch). Example: Sep 29 16:28:35 imsxen3 stonith-ng: [5182]: ERROR: run_stonith_agent: No timeout set for stonith operation monitor with device fence_legacy Sep 29 16:28:36 imsxen3 stonith: external/ipmi device OK. I downgraded the cluster-glue package (and a couple others, so RPM dependencies were still satisfied) on one machine and the messages went away on that machine, while they're still there on the others. To clarify -- the "no timeout set" error is logged on the machine the stonith resource is currently running on, each time the monitor operation fires. On the machine I downgraded cluster-glue on, there are no such errors for any stonith resource running on that server. My stonith definitions (in "crm configure" format) are like this: primitive stonith-imsxen1 stonith:external/ipmi \ meta target-role="Started" \ operations $id="stonith-imsxen2-operations" \ op monitor interval="300" timeout="15" start-delay="15" \ params hostname="imsxen1" ipaddr="10.95.12.51" userid="stonith" passwd="" interface="lanplus" and similarly for stonith-imsxen2 and stonith-imsxen3. (Node names are imsxen[123].) STONITH works properly, aside from the annoying messages with the latest version. Here is the RPM version comparison: v | SLE11-HAE-SP1-Updates | cluster-glue | 1.0.5-0.5.1 | 1.0.6-0.3.7 | x86_64 v | SLE11-HAE-SP1-Updates | libglue2 | 1.0.5-0.5.1 | 1.0.6-0.3.7 | x86_64 v | SLE11-HAE-SP1-Updates | libpacemaker3 | 1.1.2-0.2.1 | 1.1.2-0.6.1 | x86_64 v | SLE11-HAE-SP1-Updates | pacemaker | 1.1.2-0.2.1 | 1.1.2-0.6.1 | x86_64 v | SLE11-HAE-SP1-Updates | pacemaker-mgmt | 2.0.0-0.2.19| 2.0.0-0.3.10 | x86_64 I intentionally rolled back the cluster-glue package, and the others were rolled back to satisfy dependencies. According to the RPM changelog, the "good" version of cluster-glue (1.0.5-0.5.1) is from Upstream version cs: 6cf2e36df9f4, while the newer one is from cs: a146a145a3e. While it's possible this is a problem with Novell's builds, I don't think that to be likely, since there are no local patches in the RPM spec file. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] stop resource during promote
Is it ok to stop/start a resource during a promote? I'm setting up a master/slave set of resources. When a slave is promoted to master, I need to stop the resource, change a config file, then start it up in master mode. Mark ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] About behavior in "Action Lost".
Hi Andrew, > Sorry, it probably got rebased before I pushed it. > > http://hg.clusterlabs.org/pacemaker/1.1/rev/dd8e37df3e96 should be the > right link Thanks!! Hideo Yamuachi. --- Andrew Beekhof wrote: > Sorry, it probably got rebased before I pushed it. > > http://hg.clusterlabs.org/pacemaker/1.1/rev/dd8e37df3e96 should be the > right link > > On Wed, Sep 29, 2010 at 2:51 AM, wrote: > > Hi Andrew, > > > >> Pushed as: > >> � �http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 > >> > >> Not sure about applying to 1.0 though, its a dramatic change in behavior. > > > > The change of this link is not found. > > Where did you update it? > > > > Best Regards, > > Hideo Yamauchi. > > > > --- Andrew Beekhof wrote: > > > >> Pushed as: > >> � �http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 > >> > >> Not sure about applying to 1.0 though, its a dramatic change in behavior. > >> > >> On Wed, Sep 22, 2010 at 11:18 AM, � > >> wrote: > >> > Hi Andrew, > >> > > >> > Thank you for comment. > >> > > >> >> A long time ago in a galaxy far away, some messaging layers used to > >> >> loose quite a few actions, including stops. > >> >> About the same time, we decided that fencing because a stop action was > >> >> lost wasn't a good idea. > >> >> > >> >> The rationale was that if the operation eventually completed, it would > >> >> end up in the CIB anyway. > >> >> And even if it didn't, the PE would continue to try the operation > >> >> again until the whole node fell over at which point it would get shot > >> >> anyway. > >> > > >> > Sorry... > >> > I did not know the fact that there was such an argument in old days. > >> > > >> > > >> >> Now, having said that, things have improved since then and perhaps, > >> >> the interest of speeding up recovery in these situations, it is time > >> >> to stop treating stop operations differently. > >> >> Would you agree? > >> > > >> > That means, you change it in the case of "Action Lost" of the stop this > >> > time to carry out > >> stonith? > >> > If my recognition is right, I agree too. > >> > > >> > if(timer->action->type != action_type_rsc) { > >> > send_update = FALSE; > >> > } else if(safe_str_eq(task, "cancel")) { > >> > /* we dont need to update the CIB with these */ > >> > send_update = FALSE; > >> > } > >> > ---> delete "else if(safe_str_eq(task, "stop")){..}" ? > >> > > >> > if(send_update) { > >> > /* cib_action_update(timer->action, LRM_OP_PENDING, > >> > EXECRA_STATUS_UNKNOWN); */ > >> > cib_action_update(timer->action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR); > >> > } > >> > > >> > Best Regards, > >> > Hideo Yamauchi. > >> > > >> > --- Andrew Beekhof wrote: > >> > > >> >> On Tue, Sep 21, 2010 at 8:59 AM, � > >> >> wrote: > >> >> > Hi, > >> >> > > >> >> > Node was in state that the load was very high, and we confirmed > >> >> > monitor movement of > >> Pacemeker. > >> >> > Action Lost occurred in stop movement after the error of the monitor > >> >> > occurred. > >> >> > > >> >> > Sep �8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: > >> >> > Aborting transition, > action > >> lost: > >> >> [Action 9]: > >> >> > In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0) > >> >> > Sep �8 20:02:22 cgl54 crmd: [3507]: info: > >> >> > abort_transition_graph: > >> action_timer_callback:486 > >> > - > >> >> > Triggered transition abort (complete=0) : Action lost > >> >> > > >> >> > > >> >> > For the load of the node, We think that the stop movement did not go > >> >> > well. > >> >> > But cannot nodes execute stonith. > >> >> > >> >> A long time ago in a galaxy far away, some messaging layers used to > >> >> loose quite a few actions, including stops. > >> >> About the same time, we decided that fencing because a stop action was > >> >> lost wasn't a good idea. > >> >> > >> >> The rationale was that if the operation eventually completed, it would > >> >> end up in the CIB anyway. > >> >> And even if it didn't, the PE would continue to try the operation > >> >> again until the whole node fell over at which point it would get shot > >> >> anyway. > >> >> > >> >> Now, having said that, things have improved since then and perhaps, > >> >> the interest of speeding up recovery in these situations, it is time > >> >> to stop treating stop operations differently. > >> >> Would you agree? > >> >> > >> >> ___ > >> >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> >> > >> >> Project Home: http://www.clusterlabs.org > >> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >> >> Bugs: > >> >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > >> >> > >> > > >> > > >> > ___ > >> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> > > >> > Project Home: http://www.cluster
Re: [Pacemaker] /etc/hosts
Thanks for the help. We have a limited range of IP addresses. What I've decided to do is just add our range of IPs in the hosts file on each machine. And then name each host based on its IP. Then as we dynamically add nodes they will already be in the hosts file. Mark On Tue, Sep 28, 2010 at 8:37 AM, Tim Serong wrote: > On 9/28/2010 at 07:29 PM, Andrew Beekhof wrote: >> On Tue, Sep 28, 2010 at 6:05 AM, Mark Horton wrote: >> > Hello, >> > I was wondering what side effects occur if you don't add all the >> > cluster nodes to the /etc/hosts file on each node? >> > >> > I'd also be interested in hearing how others keep the hosts file in >> > sync. For example, lets say you have 3 nodes, and 1 node is currently >> > down. Then you add a 4th node, but you can't update the hosts file of >> > the down node. So you must remember to do it when it comes back up. >> > I was trying to see if there was an automated way to keep them in sync >> > in case we forget to update the hosts file on the down node. >> >> Pacemaker doesn't care, but your messaging layer (corosync or heartbeat) >> might. >> If the node that is down has no other way to find out the address of >> the new node, and the cluster is configured to start automatically >> when the machine boots, then you might have a problem. > > You might find csync2[1] useful. You can use this to synchronize config > files across a cluster. Assuming you've configured it to sync /etc/hosts, > any time you edit /etc/hosts on one node, run "csync2 -x" and it will > magically sync the changes out to the other nodes in your cluster. It's > a smart manual push mechanism, not something that runs continuously in > the background, but it's a hell of a lot better than scp and having to > remember where to copy what to, and when :) > > > There's a little section on csync2 in the SLE HAE Guide under > "Transferring the Configuration to All Nodes" at: > http://www.novell.com/documentation/sle_ha/book_sleha/?page=/documentation/sle_ha/book_sleha/data/sec_ha_installation_setup.html > > > HTH > > Tim > > [1] http://oss.linbit.com/csync2/ > > > -- > Tim Serong > Senior Clustering Engineer, OPS Engineering, Novell Inc. > > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Monitor ops do not get cancelled
On Tue, Sep 28, 2010 at 2:55 PM, Phil Armstrong wrote: >> From Andrew Beekof >> 1.1.3 came out the other day. >> which distro are you using? > > I'm not sure if this answers your question: > > novell/sles/updates/SLE11-HAE-SP1-Updates/sle-11-ia64 hmm, that doesn't tell me much about whats in that version of pacemaker Could you show me the result of: crm_report --version > >> H, which version of cluster-glue do you have? >> This sounds like it might be related to >> >> dejan () High: LRM: lrmd: don't allow cancelled operations to get >> back >> to the repeating op list (lf#2417) CS: fc141b7e1e19 On: 2010-06-10 >> which first appeared in cluster-glue 1.0.6 IIRC > > As luck would have it, > su pry -> rpm -q cluster-glue cluster-glue-1.0.6-0.3.7 Dejan - do you know if that changeset is in that version? If so, we then need to make sure the relevant pacemaker changes are also there. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] stonith-ng message in /var/log/messages
On Wed, Sep 29, 2010 at 11:57 PM, Andrew Daugherity wrote: > Ron Kerry writes: >> I am seeing the following sequence of messages with every monitor interval >> for > my stonith resource. >> >> Sep 28 10:44:01 genesis stonith-ng: [9493]: ERROR: run_stonith_agent: No > timeout set for stonith >> operation monitor with device fence_legacy >> Sep 28 10:44:01 genesis stonith: l2network device OK. >> >> It is unclear to me what this ERROR means as the resource itself says > everything is fine. There is a >> monitor timeout set in the resource definition. >> >> Distribution is SLES11SP1 (SLE11SP1-HAE). >> cluster-glue 1.0.6-0.3.7 > > I'm seeing the same problem ever since the latest update rollup from Novell > (the > "sleshasp1-ha-update-201009" patch). Example: > Sep 29 16:28:35 imsxen3 stonith-ng: [5182]: ERROR: run_stonith_agent: No > timeout > set for stonith operation monitor with device fence_legacy > Sep 29 16:28:36 imsxen3 stonith: external/ipmi device OK. I believe its been fixed upstream, I guess novell needs to apply the other half of the patch. > > I downgraded the cluster-glue package (and a couple others, so RPM > dependencies > were still satisfied) on one machine and the messages went away on that > machine, > while they're still there on the others. > > To clarify -- the "no timeout set" error is logged on the machine the stonith > resource is currently running on, each time the monitor operation fires. On > the > machine I downgraded cluster-glue on, there are no such errors for any stonith > resource running on that server. > > My stonith definitions (in "crm configure" format) are like this: > primitive stonith-imsxen1 stonith:external/ipmi \ > meta target-role="Started" \ > operations $id="stonith-imsxen2-operations" \ > op monitor interval="300" timeout="15" start-delay="15" \ > params hostname="imsxen1" ipaddr="10.95.12.51" userid="stonith" > passwd="" > interface="lanplus" > and similarly for stonith-imsxen2 and stonith-imsxen3. (Node names are > imsxen[123].) > > STONITH works properly, aside from the annoying messages with the latest > version. > > Here is the RPM version comparison: > v | SLE11-HAE-SP1-Updates | cluster-glue | 1.0.5-0.5.1 | > 1.0.6-0.3.7 | x86_64 > v | SLE11-HAE-SP1-Updates | libglue2 | 1.0.5-0.5.1 | > 1.0.6-0.3.7 | x86_64 > v | SLE11-HAE-SP1-Updates | libpacemaker3 | 1.1.2-0.2.1 | > 1.1.2-0.6.1 | x86_64 > v | SLE11-HAE-SP1-Updates | pacemaker | 1.1.2-0.2.1 | > 1.1.2-0.6.1 | x86_64 > v | SLE11-HAE-SP1-Updates | pacemaker-mgmt | 2.0.0-0.2.19 | > 2.0.0-0.3.10 | x86_64 > > I intentionally rolled back the cluster-glue package, and the others were > rolled > back to satisfy dependencies. According to the RPM changelog, the "good" > version of cluster-glue (1.0.5-0.5.1) is from Upstream version cs: > 6cf2e36df9f4, > while the newer one is from cs: a146a145a3e. > > While it's possible this is a problem with Novell's builds, I don't think that > to be likely, since there are no local patches in the RPM spec file. > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] starting a xen-domU depending on available hardware-resources using SysInfo-RA
Hi Dejan, it's working fine with the amount of free ram as the score and a bigger default-resource-stickiness: primitive v01 ocf:heartbeat:Xen \ params xmfile="/etc/xen/conf.d/v01.cfg" \ op monitor interval="30s" timeout="30s" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="40s" allow-migrate="true" \ meta target-role="Started" primitive v02 ocf:heartbeat:Xen \ params xmfile="/etc/xen/conf.d/v02.cfg" \ op monitor interval="30s" timeout="30s" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="40s" allow-migrate="true" \ meta target-role="Started" primitive v03 ocf:heartbeat:Xen \ params xmfile="/etc/xen/conf.d/v03.cfg" \ op monitor interval="30s" timeout="30s" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="40s" allow-migrate="true" \ meta target-role="Started" location RAM01-v01 v01 \ rule $id="loc-resv01-rule" ram_free: ram_free gt 6000 location RAM01-v02 v02 \ rule $id="loc-resv02-rule" ram_free: ram_free gt 3000 location RAM01-v03 v03 \ rule $id="RAM01-v03-rule" ram_free: ram_free gt 1000 property $id="cib-bootstrap-options" \ dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \ cluster-infrastructure="openais" \ expected-quorum-votes="4" \ stonith-enabled="false" \ default-resource-stickiness="16000" \ last-lrm-refresh="1285761587" thanks! On 09/28/2010 12:18 PM, Dejan Muhamedagic wrote: Hi, On Tue, Sep 28, 2010 at 11:00:18AM +0200, Sascha Reimann wrote: howdy! I'm trying to configure a resource (xen-domU) that could start on 2 nodes (preferred on node server01): primitive v01 ocf:heartbeat:Xen \ params xmfile="/etc/xen/conf.d/v01.cfg" allow-migrate="true" location loc-v01p v01 200: server01 location loc-v01s v01 100: server02 That's working fine so far, but I want to ensure that there's enough hardwareresources available on server01, so I've set up a modified SysInfo-RA to put the ram_total and ram_free values of xen (xm info|awk '/free_memory/ {print $3}') to the statusinformation of the CIB: server01:~$ cibadmin -Q -o status|grep status-server01-ram This is working fine, too. BUT: When I create a rule like the one below, the xen-domU keeps restarting (or moving to server02 where the same happens), which is correct since the SysInfo-RA updates the statusinformation to value="0" after a start and back to value="2000" after a stop in this example. location loc-resv01 v01 \ rule $id="loc-resv01-rule" -inf: ram_free lt 2000 An interesting issue :-) Well, you can introduce resource stickiness and use that to outweigh the negative score coming from the lack of memory (use something less than inf). You may also consider using the amount of free memory as a score. HTH, Dejan Can anybody help? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Für weitere Fragen stehen wir Ihnen gerne zur Verfügung. Mit freundlichen Grüßen Sascha Reimann === - Hostway Deutschland GmbH - Am Mittelfelde 29, D 30519 Hannover, Germany - Fon +49 (0)511 71260-100, Fax +49 (0)511 71260-198 Geschäftsführer Cord Bansemer (CEO) Dr. Achilleas Anastasiadis Datenschutzbeauftragter lt. BDSG RA Thomas Lehmacher Zuständiges Handelsregister: Amtsgericht Hannover HRB 202097 Zuständiges Finanzamt: Finanzamt Hannover USt-IdNr. DE204915504 Bankverbindung: Dresdner Bank AG KTO 0 111 085 800 · BLZ 250 800 20 === HINWEIS: Diese Email und etwaige Anlagen beinhalten vertrauliche und/oder rechtlich geschützte Informationen und sind nur für den Adressaten bestimmt. Sollten Sie nicht der beabsichtigte Empfänger der Nachricht sein, oder diese Nachricht versehentlich erhalten haben, sind Sie nicht berechtigt, den Inhalt der Nachricht weiterzuleiten, kopieren oder den Inhalt auf eine andere Art zu verbreiten. Wenn Sie diese Nachricht versehentlich erhalten haben, benachrichtigen Sie bitte den Absender und löschen Sie umgehend und dauerhaft die Nachricht mitsamt den Anlagen von Ihrem System. NOTICE: This email and any file transmitted are confidential and/or legally privileged and intended only for the person(s) directly addressed. If you are not the intended recipient, any use, copying, transmission, distribution, or other forms of dissemination is strictly prohibited. If you have received this email in error, please notify the sender immediately and permanently delete the email and files,