Re: [ClusterLabs] Pacemaker documentation license clarification
Ken Gaillotwrites: > Currently, the brand is specified in each book's publican.cfg (which is > generated by configure, and can be edited by "make www-cli"). It works, > so realistically it's a low priority to improve it, given everything > else on the plate. Well, it's not pretty to say the least, but I don't think I have to touch that part. > You're welcome to submit a pull request to change it to use the local > brand directory. Done, it's part of https://github.com/ClusterLabs/pacemaker/pull/876. That pull request contains three independent patches, feel free to cherry pick only part of it if you find anything objectionable. > Be sure to consider that each book comes in multiple formats (and > potentially translations, though they're out of date at this point, > which is a whole separate discussion worth raising at some point), and > add anything generated to .gitignore. I think this minimal change won't cause problems with other format or translations. I forgot about gitignoring the xsl symlink though; I can add that after the initial review. -- Regards, Feri. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Anyone successfully install Pacemaker/Corosync on Freebsd?
Christine Caulfield napsal(a): On 21/12/15 16:12, Ken Gaillot wrote: On 12/19/2015 04:56 PM, mike wrote: Hi All, just curious if anyone has had any luck at one point installing Pacemaker and Corosync on FreeBSD. I have to install from source of course and I've run into an issue when running ./configure while trying to install Corosync. The process craps out at nss with this error: FYI, Ruben Kerkhof has done some recent work to get the FreeBSD build working. It will go into the next 1.1.14 release candidate. In the meantime, make sure you have the very latest code from upstream's 1.1 branch. I also strongly recommend using the latest (from git) version of libqb has it has some FreeBSD bugs fixed in it. We plan to do a proper release of this in the new year. Same applies also for corosync. Use git and it should work (even with clang). Honza Chrissie checking for nss... configure: error: in `/root/heartbeat/corosync-2.3.3': configure: error: The pkg-config script could not be found or is too old. Make sure it is in your PATH or set the PKG_CONFIG environment variable to the full path to pkg-config. Alternatively, you may set the environment variables nss_CFLAGS and nss_LIBS to avoid the need to call pkg-config. See the pkg-config man page for more details. I've looked unsuccessfully for a package called pkg-config and nss appears to be installed as you can see from this output: root@wellesley:~/heartbeat/corosync-2.3.3 # pkg install nss Updating FreeBSD repository catalogue... FreeBSD repository is up-to-date. All repositories are up-to-date. Checking integrity... done (0 conflicting) The most recent version of packages are already installed Anyway - just looking for any suggestions. Hoping that perhaps someone has successfully done this. thanks in advance -mgb ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Asking for a new DLM release
Ferenc Wagnerwrites: > DLM 4.0.2 was released on 2013-07-31. The Git repo accumulated some > fixes since then, which would be nice to have in a proper release. By the way I offer https://github.com/wferi/dlm/commits/upstream-patches for merging or cherry-picking into upstream. And if I'm hitting the wrong forum with this DLM topic, please advise me. -- Thanks, Feri. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact
On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote: > On 04.01.2016 15:50, Bogdan Dobrelya wrote: >> So far so bad. >> I made a dummy OCF script [0] to simulate an example >> promote/demote/notify failure mode for a multistate clone resource which >> is very similar to the one I reported originally. And the test to >> reproduce my case with the dummy is: >> - install dummy resource ocf ra and create the dummy resource as README >> [0] says >> - just watch the a) OCF logs from the dummy and b) outputs for the >> reoccurring commands: >> >> # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1; >> sleep 20; done& >> # crm_resource --resource p_dummy --list-operations >> >> At some point I noticed: >> - there are no more "OK" messages logged from the monitor actions, >> although according to the trace_ra dumps' timestamps, all monitors are >> still being invoked! >> >> - at some point I noticed very strange results reported by the: >> # crm_resource --resource p_dummy --list-operations >> p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000 >> (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan >> 4 14:33:07 2016, exec=62107ms): Timed Out >> or >> p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000 >> (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan 4 >> 14:43:58 2016, exec=0ms): Timed Out >> >> - according to the trace_ra dumps reoccurring monitors are being invoked >> by the intervals *much longer* than configured. For example, a 7 minutes >> of "monitoring silence": >> Mon Jan 4 14:47:46 UTC 2016 >> p_dummy.monitor.2016-01-04.14:40:52 >> Mon Jan 4 14:48:06 UTC 2016 >> p_dummy.monitor.2016-01-04.14:47:58 >> >> Given that said, it is very likely there is some bug exist for >> monitoring multi-state clones in pacemaker! >> >> [0] https://github.com/bogdando/dummy-ocf-ra >> > > Also note, that lrmd spawns *many* monitors like: > root 6495 0.0 0.0 70268 1456 ?Ss2015 4:56 \_ > /usr/lib/pacemaker/lrmd > root 31815 0.0 0.0 4440 780 ?S15:08 0:00 | \_ > /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor > root 31908 0.0 0.0 4440 388 ?S15:08 0:00 | > \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor > root 31910 0.0 0.0 4440 384 ?S15:08 0:00 | > \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor > root 31915 0.0 0.0 4440 392 ?S15:08 0:00 | > \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor > ... At first glance, that looks like your monitor action is calling itself recursively, but I don't see how in your code. > At some point, there was already. Then I unmanaged the p_dummy but > it grew up to the 2403 after that. The number of running monitors may > grow or decrease as well. > Also, the /var/lib/heartbeat/trace_ra/dummy/ still have been populated > by new p_dummy.monitor* files with recent timestamps. Why?.. > > If I pkill -9 all dummy monitors, lrmd spawns another ~2000 almost > instantly :) Unless the node became unresponsive at some point. And > after restarted by power off: > # crm_resource --resource p_dummy --list-operations > p_dummy (ocf::dummy:dummy): Started (unmanaged) : > p_dummy_monitor_3 (node=node-1.test.domain.local, call=679, rc=1, > last-rc-change=Mon Jan 4 15:04:25 2016, exec=66747ms): Timed Out > or > p_dummy (ocf::dummy:dummy): Stopped (unmanaged) : > p_dummy_monitor_103000 (node=node-3.test.domain.local, call=142, rc=1, > last-rc-change=Mon Jan 4 15:14:59 2016, exec=65237ms): Timed Out > > And then lrmd repeats all of the fun again. > > ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact
On 04.01.2016 16:36, Ken Gaillot wrote: > On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote: >> On 04.01.2016 15:50, Bogdan Dobrelya wrote: >>> So far so bad. >>> I made a dummy OCF script [0] to simulate an example >>> promote/demote/notify failure mode for a multistate clone resource which >>> is very similar to the one I reported originally. And the test to >>> reproduce my case with the dummy is: >>> - install dummy resource ocf ra and create the dummy resource as README >>> [0] says >>> - just watch the a) OCF logs from the dummy and b) outputs for the >>> reoccurring commands: >>> >>> # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1; >>> sleep 20; done& >>> # crm_resource --resource p_dummy --list-operations >>> >>> At some point I noticed: >>> - there are no more "OK" messages logged from the monitor actions, >>> although according to the trace_ra dumps' timestamps, all monitors are >>> still being invoked! >>> >>> - at some point I noticed very strange results reported by the: >>> # crm_resource --resource p_dummy --list-operations >>> p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000 >>> (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan >>> 4 14:33:07 2016, exec=62107ms): Timed Out >>> or >>> p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000 >>> (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan 4 >>> 14:43:58 2016, exec=0ms): Timed Out >>> >>> - according to the trace_ra dumps reoccurring monitors are being invoked >>> by the intervals *much longer* than configured. For example, a 7 minutes >>> of "monitoring silence": >>> Mon Jan 4 14:47:46 UTC 2016 >>> p_dummy.monitor.2016-01-04.14:40:52 >>> Mon Jan 4 14:48:06 UTC 2016 >>> p_dummy.monitor.2016-01-04.14:47:58 >>> >>> Given that said, it is very likely there is some bug exist for >>> monitoring multi-state clones in pacemaker! >>> >>> [0] https://github.com/bogdando/dummy-ocf-ra >>> >> >> Also note, that lrmd spawns *many* monitors like: >> root 6495 0.0 0.0 70268 1456 ?Ss2015 4:56 \_ >> /usr/lib/pacemaker/lrmd >> root 31815 0.0 0.0 4440 780 ?S15:08 0:00 | \_ >> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor >> root 31908 0.0 0.0 4440 388 ?S15:08 0:00 | >> \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor >> root 31910 0.0 0.0 4440 384 ?S15:08 0:00 | >> \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor >> root 31915 0.0 0.0 4440 392 ?S15:08 0:00 | >> \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor >> ... > > At first glance, that looks like your monitor action is calling itself > recursively, but I don't see how in your code. Yes, it should be a bug in the ocf-shellfuncs's ocf_log(). If I replace it in the dummy RA to the: #. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs ocf_log() { logger $HA_LOGFACILITY -t $HA_LOGTAG "$@" } there is no such issue anymore. And I see log messages "It's OK" as expected. Note, I used the resource-agents 3.9.5+git+a626847-1 from [0]. [0] http://ftp.de.debian.org/debian/ experimental/main amd64 Packages > >> At some point, there was already. Then I unmanaged the p_dummy but >> it grew up to the 2403 after that. The number of running monitors may >> grow or decrease as well. >> Also, the /var/lib/heartbeat/trace_ra/dummy/ still have been populated >> by new p_dummy.monitor* files with recent timestamps. Why?.. >> >> If I pkill -9 all dummy monitors, lrmd spawns another ~2000 almost >> instantly :) Unless the node became unresponsive at some point. And >> after restarted by power off: >> # crm_resource --resource p_dummy --list-operations >> p_dummy (ocf::dummy:dummy): Started (unmanaged) : >> p_dummy_monitor_3 (node=node-1.test.domain.local, call=679, rc=1, >> last-rc-change=Mon Jan 4 15:04:25 2016, exec=66747ms): Timed Out >> or >> p_dummy (ocf::dummy:dummy): Stopped (unmanaged) : >> p_dummy_monitor_103000 (node=node-3.test.domain.local, call=142, rc=1, >> last-rc-change=Mon Jan 4 15:14:59 2016, exec=65237ms): Timed Out >> >> And then lrmd repeats all of the fun again. >> >> > > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Best regards, Bogdan Dobrelya, Irc #bogdando ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] [Q] Cluster failovers too soon
< I am resending this mail, because of the outage of clusterlabs during the weekend, a received error message and my timelimit until next week> Hello guys, happy new year to all of you! I have a little (/understanding?/) problem regarding Heartbeat/Pacemaker and deadtime/timeout. I know that corosync is the way the go, but atm I have a heartbeat cluster and need to adjust it's time before a failover is initiated. My cluster and resources completely ignore the heartbeat deadtime raise and the timeout in pacemaker resource agents definitions. When I shut him off, the node gets shown as offline and the services are failovered after 4-9 seconds. But I want 20 seconds. What do I have to adjust, to make the cluster failover after +- 20 seconds instead of 9? Do I miss a parameter apart from deadtime(deadping) and timeout? Every hint would be a great help! Thank you very much Sebish *Config:* -- *_/etc/heartbeat/ha.cf_**:* ... keepalive 2 warntime 6 deadtime 20 initdead 60 ... *_crm (pacemaker)_:* node $id="6acc2585-b49b-4b0f-8b2a-8561cceb8b83" nodec node $id="891a8209-5e1a-40b6-8d72-8458a851bb9a" kamailioopenhab2 node $id="fd898711-4c76-4d00-941c-4528e174533c" kamailioopenhab1 primitive ClusterMon ocf:pacemaker:ClusterMon \ params user="root" update="30" extra_options="-E /usr/lib/ocf/resource.d/*myname*/*script*.sh" \ op monitor interval="10" timeout="40" on-fail="restart" primitive FailoverIP ocf:heartbeat:IPaddr2 \ params ip="*ClusterIP*" cidr_netmask="18" \ op monitor interval="2s" timeout="20" primitive Openhab lsb:openhab \ meta target-role="Started" \ op monitor interval="2s" timeout="20" primitive Ping ocf:pacemaker:ping \ params host_list="*ClusterIP*" multiplier="100" \ op monitor interval="2s" timeout="20" location ClusterMon_LocationA ClusterMon -inf: kamailioopenhab1 location ClusterMon_LocationB ClusterMon 10: kamailioopenhab2 location ClusterMon_LocationC ClusterMon inf: nodec location FailoverIP_LocationA FailoverIP 20: kamailioopenhab1 location FailoverIP_LocationB FailoverIP 10: kamailioopenhab2 location FailoverIP_LocationC FailoverIP -inf: nodec colocation Services_Colocation inf: FailoverIP Kamailio Openhab property $id="cib-bootstrap-options" \ dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \ cluster-infrastructure="Heartbeat" \ expected-quorum-votes="2" \ last-lrm-refresh="1451669632" \ stonith-enabled="false" \ no-quorum-policy="ignore" rsc_defaults $id="rsc-options" \ resource-stickiness="100" -- ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact
On 04.01.2016 15:50, Bogdan Dobrelya wrote: > So far so bad. > I made a dummy OCF script [0] to simulate an example > promote/demote/notify failure mode for a multistate clone resource which > is very similar to the one I reported originally. And the test to > reproduce my case with the dummy is: > - install dummy resource ocf ra and create the dummy resource as README > [0] says > - just watch the a) OCF logs from the dummy and b) outputs for the > reoccurring commands: > > # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1; > sleep 20; done& > # crm_resource --resource p_dummy --list-operations > > At some point I noticed: > - there are no more "OK" messages logged from the monitor actions, > although according to the trace_ra dumps' timestamps, all monitors are > still being invoked! > > - at some point I noticed very strange results reported by the: > # crm_resource --resource p_dummy --list-operations > p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000 > (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan > 4 14:33:07 2016, exec=62107ms): Timed Out > or > p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000 > (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan 4 > 14:43:58 2016, exec=0ms): Timed Out > > - according to the trace_ra dumps reoccurring monitors are being invoked > by the intervals *much longer* than configured. For example, a 7 minutes > of "monitoring silence": > Mon Jan 4 14:47:46 UTC 2016 > p_dummy.monitor.2016-01-04.14:40:52 > Mon Jan 4 14:48:06 UTC 2016 > p_dummy.monitor.2016-01-04.14:47:58 > > Given that said, it is very likely there is some bug exist for > monitoring multi-state clones in pacemaker! > > [0] https://github.com/bogdando/dummy-ocf-ra > Also note, that lrmd spawns *many* monitors like: root 6495 0.0 0.0 70268 1456 ?Ss2015 4:56 \_ /usr/lib/pacemaker/lrmd root 31815 0.0 0.0 4440 780 ?S15:08 0:00 | \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor root 31908 0.0 0.0 4440 388 ?S15:08 0:00 | \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor root 31910 0.0 0.0 4440 384 ?S15:08 0:00 | \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor root 31915 0.0 0.0 4440 392 ?S15:08 0:00 | \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor ... At some point, there was already. Then I unmanaged the p_dummy but it grew up to the 2403 after that. The number of running monitors may grow or decrease as well. Also, the /var/lib/heartbeat/trace_ra/dummy/ still have been populated by new p_dummy.monitor* files with recent timestamps. Why?.. If I pkill -9 all dummy monitors, lrmd spawns another ~2000 almost instantly :) Unless the node became unresponsive at some point. And after restarted by power off: # crm_resource --resource p_dummy --list-operations p_dummy (ocf::dummy:dummy): Started (unmanaged) : p_dummy_monitor_3 (node=node-1.test.domain.local, call=679, rc=1, last-rc-change=Mon Jan 4 15:04:25 2016, exec=66747ms): Timed Out or p_dummy (ocf::dummy:dummy): Stopped (unmanaged) : p_dummy_monitor_103000 (node=node-3.test.domain.local, call=142, rc=1, last-rc-change=Mon Jan 4 15:14:59 2016, exec=65237ms): Timed Out And then lrmd repeats all of the fun again. -- Best regards, Bogdan Dobrelya, Irc #bogdando ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact
Hi, On Mon, Jan 04, 2016 at 04:52:43PM +0100, Bogdan Dobrelya wrote: > On 04.01.2016 16:36, Ken Gaillot wrote: > > On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote: > >> On 04.01.2016 15:50, Bogdan Dobrelya wrote: [...] > >> Also note, that lrmd spawns *many* monitors like: > >> root 6495 0.0 0.0 70268 1456 ?Ss2015 4:56 \_ > >> /usr/lib/pacemaker/lrmd > >> root 31815 0.0 0.0 4440 780 ?S15:08 0:00 | \_ > >> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor > >> root 31908 0.0 0.0 4440 388 ?S15:08 0:00 | > >> \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor > >> root 31910 0.0 0.0 4440 384 ?S15:08 0:00 | > >> \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor > >> root 31915 0.0 0.0 4440 392 ?S15:08 0:00 | > >> \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor > >> ... > > > > At first glance, that looks like your monitor action is calling itself > > recursively, but I don't see how in your code. > > Yes, it should be a bug in the ocf-shellfuncs's ocf_log(). If you're sure about that, please open an issue at https://github.com/ClusterLabs/resource-agents/issues Thanks, Dejan ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org