Re: [ClusterLabs] Pacemaker documentation license clarification

2016-01-04 Thread Ferenc Wagner
Ken Gaillot  writes:

> Currently, the brand is specified in each book's publican.cfg (which is
> generated by configure, and can be edited by "make www-cli"). It works,
> so realistically it's a low priority to improve it, given everything
> else on the plate.

Well, it's not pretty to say the least, but I don't think I have to
touch that part.

> You're welcome to submit a pull request to change it to use the local
> brand directory.

Done, it's part of https://github.com/ClusterLabs/pacemaker/pull/876.
That pull request contains three independent patches, feel free to
cherry pick only part of it if you find anything objectionable.

> Be sure to consider that each book comes in multiple formats (and
> potentially translations, though they're out of date at this point,
> which is a whole separate discussion worth raising at some point), and
> add anything generated to .gitignore.

I think this minimal change won't cause problems with other format or
translations.  I forgot about gitignoring the xsl symlink though; I can
add that after the initial review.
-- 
Regards,
Feri.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Anyone successfully install Pacemaker/Corosync on Freebsd?

2016-01-04 Thread Jan Friesse

Christine Caulfield napsal(a):

On 21/12/15 16:12, Ken Gaillot wrote:

On 12/19/2015 04:56 PM, mike wrote:

Hi All,

just curious if anyone has had any luck at one point installing
Pacemaker and Corosync on FreeBSD. I have to install from source of
course and I've run into an issue when running ./configure while trying
to install Corosync. The process craps out at nss with this error:


FYI, Ruben Kerkhof has done some recent work to get the FreeBSD build
working. It will go into the next 1.1.14 release candidate. In the
meantime, make sure you have the very latest code from upstream's 1.1
branch.



I also strongly recommend using the latest (from git) version of libqb
has it has some FreeBSD bugs fixed in it. We plan to do a proper release
of this in the new year.


Same applies also for corosync. Use git and it should work (even with 
clang).


Honza



Chrissie


checking for nss... configure: error: in `/root/heartbeat/corosync-2.3.3':
configure: error: The pkg-config script could not be found or is too
old. Make sure it
is in your PATH or set the PKG_CONFIG environment variable to the full
path to pkg-config.​
Alternatively, you may set the environment variables nss_CFLAGS
and nss_LIBS to avoid the need to call pkg-config.
See the pkg-config man page for more details.

I've looked unsuccessfully for a package called pkg-config and nss
appears to be installed as you can see from this output:

root@wellesley:~/heartbeat/corosync-2.3.3 # pkg install nss
Updating FreeBSD repository catalogue...
FreeBSD repository is up-to-date.
All repositories are up-to-date.
Checking integrity... done (0 conflicting)
The most recent version of packages are already installed

Anyway - just looking for any suggestions. Hoping that perhaps someone
has successfully done this.

thanks in advance
-mgb



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Asking for a new DLM release

2016-01-04 Thread Ferenc Wagner
Ferenc Wagner  writes:

> DLM 4.0.2 was released on 2013-07-31.  The Git repo accumulated some
> fixes since then, which would be nice to have in a proper release.

By the way I offer https://github.com/wferi/dlm/commits/upstream-patches
for merging or cherry-picking into upstream.

And if I'm hitting the wrong forum with this DLM topic, please advise me.
-- 
Thanks,
Feri.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Ken Gaillot
On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
>> So far so bad.
>> I made a dummy OCF script [0] to simulate an example
>> promote/demote/notify failure mode for a multistate clone resource which
>> is very similar to the one I reported originally. And the test to
>> reproduce my case with the dummy is:
>> - install dummy resource ocf ra and create the dummy resource as README
>> [0] says
>> - just watch the a) OCF logs from the dummy and b) outputs for the
>> reoccurring commands:
>>
>> # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1;
>> sleep 20; done&
>> # crm_resource --resource p_dummy --list-operations
>>
>> At some point I noticed:
>> - there are no more "OK" messages logged from the monitor actions,
>> although according to the trace_ra dumps' timestamps, all monitors are
>> still being invoked!
>>
>> - at some point I noticed very strange results reported by the:
>> # crm_resource --resource p_dummy --list-operations
>> p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000
>> (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan
>> 4 14:33:07 2016, exec=62107ms): Timed Out
>>   or
>> p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000
>> (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan  4
>> 14:43:58 2016, exec=0ms): Timed Out
>>
>> - according to the trace_ra dumps reoccurring monitors are being invoked
>> by the intervals *much longer* than configured. For example, a 7 minutes
>> of "monitoring silence":
>> Mon Jan  4 14:47:46 UTC 2016
>> p_dummy.monitor.2016-01-04.14:40:52
>> Mon Jan  4 14:48:06 UTC 2016
>> p_dummy.monitor.2016-01-04.14:47:58
>>
>> Given that said, it is very likely there is some bug exist for
>> monitoring multi-state clones in pacemaker!
>>
>> [0] https://github.com/bogdando/dummy-ocf-ra
>>
> 
> Also note, that lrmd spawns *many* monitors like:
> root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
> /usr/lib/pacemaker/lrmd
> root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> ...

At first glance, that looks like your monitor action is calling itself
recursively, but I don't see how in your code.

> At some point, there was  already. Then I unmanaged the p_dummy but
> it grew up to the 2403 after that. The number of running monitors may
> grow or decrease as well.
> Also, the /var/lib/heartbeat/trace_ra/dummy/ still have been populated
> by new p_dummy.monitor* files with recent timestamps. Why?..
> 
> If I pkill -9 all dummy monitors, lrmd spawns another ~2000 almost
> instantly :) Unless the node became unresponsive at some point. And
> after restarted by power off:
> # crm_resource --resource p_dummy --list-operations
> p_dummy (ocf::dummy:dummy): Started (unmanaged) :
> p_dummy_monitor_3 (node=node-1.test.domain.local, call=679, rc=1,
> last-rc-change=Mon Jan  4 15:04:25 2016, exec=66747ms): Timed Out
> or
> p_dummy (ocf::dummy:dummy): Stopped (unmanaged) :
> p_dummy_monitor_103000 (node=node-3.test.domain.local, call=142, rc=1,
> last-rc-change=Mon Jan  4 15:14:59 2016, exec=65237ms): Timed Out
> 
> And then lrmd repeats all of the fun again.
> 
> 


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Bogdan Dobrelya
On 04.01.2016 16:36, Ken Gaillot wrote:
> On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
>> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
>>> So far so bad.
>>> I made a dummy OCF script [0] to simulate an example
>>> promote/demote/notify failure mode for a multistate clone resource which
>>> is very similar to the one I reported originally. And the test to
>>> reproduce my case with the dummy is:
>>> - install dummy resource ocf ra and create the dummy resource as README
>>> [0] says
>>> - just watch the a) OCF logs from the dummy and b) outputs for the
>>> reoccurring commands:
>>>
>>> # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1;
>>> sleep 20; done&
>>> # crm_resource --resource p_dummy --list-operations
>>>
>>> At some point I noticed:
>>> - there are no more "OK" messages logged from the monitor actions,
>>> although according to the trace_ra dumps' timestamps, all monitors are
>>> still being invoked!
>>>
>>> - at some point I noticed very strange results reported by the:
>>> # crm_resource --resource p_dummy --list-operations
>>> p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000
>>> (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan
>>> 4 14:33:07 2016, exec=62107ms): Timed Out
>>>   or
>>> p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000
>>> (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan  4
>>> 14:43:58 2016, exec=0ms): Timed Out
>>>
>>> - according to the trace_ra dumps reoccurring monitors are being invoked
>>> by the intervals *much longer* than configured. For example, a 7 minutes
>>> of "monitoring silence":
>>> Mon Jan  4 14:47:46 UTC 2016
>>> p_dummy.monitor.2016-01-04.14:40:52
>>> Mon Jan  4 14:48:06 UTC 2016
>>> p_dummy.monitor.2016-01-04.14:47:58
>>>
>>> Given that said, it is very likely there is some bug exist for
>>> monitoring multi-state clones in pacemaker!
>>>
>>> [0] https://github.com/bogdando/dummy-ocf-ra
>>>
>>
>> Also note, that lrmd spawns *many* monitors like:
>> root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
>> /usr/lib/pacemaker/lrmd
>> root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
>> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
>>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
>>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
>>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> ...
> 
> At first glance, that looks like your monitor action is calling itself
> recursively, but I don't see how in your code.

Yes, it should be a bug in the ocf-shellfuncs's ocf_log().

If I replace it in the dummy RA to the:
#. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
ocf_log() {
  logger $HA_LOGFACILITY -t $HA_LOGTAG "$@"
}

there is no such issue anymore. And I see log messages "It's OK"
as expected.
Note, I used the resource-agents 3.9.5+git+a626847-1
from [0].

[0] http://ftp.de.debian.org/debian/ experimental/main amd64 Packages

> 
>> At some point, there was  already. Then I unmanaged the p_dummy but
>> it grew up to the 2403 after that. The number of running monitors may
>> grow or decrease as well.
>> Also, the /var/lib/heartbeat/trace_ra/dummy/ still have been populated
>> by new p_dummy.monitor* files with recent timestamps. Why?..
>>
>> If I pkill -9 all dummy monitors, lrmd spawns another ~2000 almost
>> instantly :) Unless the node became unresponsive at some point. And
>> after restarted by power off:
>> # crm_resource --resource p_dummy --list-operations
>> p_dummy (ocf::dummy:dummy): Started (unmanaged) :
>> p_dummy_monitor_3 (node=node-1.test.domain.local, call=679, rc=1,
>> last-rc-change=Mon Jan  4 15:04:25 2016, exec=66747ms): Timed Out
>> or
>> p_dummy (ocf::dummy:dummy): Stopped (unmanaged) :
>> p_dummy_monitor_103000 (node=node-3.test.domain.local, call=142, rc=1,
>> last-rc-change=Mon Jan  4 15:14:59 2016, exec=65237ms): Timed Out
>>
>> And then lrmd repeats all of the fun again.
>>
>>
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] [Q] Cluster failovers too soon

2016-01-04 Thread Sebish
< I am resending this mail, because of the outage of clusterlabs during 
the weekend, a received error message and my timelimit until next week>



Hello guys,

happy new year to all of you!

I have a little (/understanding?/) problem regarding Heartbeat/Pacemaker 
and deadtime/timeout.
I know that corosync is the way the go, but atm I have a heartbeat 
cluster and need to adjust it's time before a failover is initiated.


My cluster and resources completely ignore the heartbeat deadtime raise 
and the timeout in pacemaker resource agents definitions.
When I shut him off, the node gets shown as offline and the services are 
failovered after 4-9 seconds. But I want 20 seconds.


What do I have to adjust, to make the cluster failover after +- 20 
seconds instead of 9? Do I miss a parameter apart from 
deadtime(deadping) and timeout?

Every hint would be a great help!


Thank you very much
Sebish


*Config:*
--

*_/etc/heartbeat/ha.cf_**:*

...
keepalive 2
warntime 6
deadtime 20
initdead 60
...

*_crm (pacemaker)_:*

node $id="6acc2585-b49b-4b0f-8b2a-8561cceb8b83" nodec
node $id="891a8209-5e1a-40b6-8d72-8458a851bb9a" kamailioopenhab2
node $id="fd898711-4c76-4d00-941c-4528e174533c" kamailioopenhab1
primitive ClusterMon ocf:pacemaker:ClusterMon \
params user="root" update="30" extra_options="-E 
/usr/lib/ocf/resource.d/*myname*/*script*.sh" \

op monitor interval="10" timeout="40" on-fail="restart"
primitive FailoverIP ocf:heartbeat:IPaddr2 \
params ip="*ClusterIP*" cidr_netmask="18" \
op monitor interval="2s" timeout="20"
primitive Openhab lsb:openhab \
meta target-role="Started" \
op monitor interval="2s" timeout="20"
primitive Ping ocf:pacemaker:ping \
params host_list="*ClusterIP*" multiplier="100" \
op monitor interval="2s" timeout="20"
location ClusterMon_LocationA ClusterMon -inf: kamailioopenhab1
location ClusterMon_LocationB ClusterMon 10: kamailioopenhab2
location ClusterMon_LocationC ClusterMon inf: nodec
location FailoverIP_LocationA FailoverIP 20: kamailioopenhab1
location FailoverIP_LocationB FailoverIP 10: kamailioopenhab2
location FailoverIP_LocationC FailoverIP -inf: nodec
colocation Services_Colocation inf: FailoverIP Kamailio Openhab
property $id="cib-bootstrap-options" \
dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
cluster-infrastructure="Heartbeat" \
expected-quorum-votes="2" \
last-lrm-refresh="1451669632" \
stonith-enabled="false" \
no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
--

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Bogdan Dobrelya
On 04.01.2016 15:50, Bogdan Dobrelya wrote:
> So far so bad.
> I made a dummy OCF script [0] to simulate an example
> promote/demote/notify failure mode for a multistate clone resource which
> is very similar to the one I reported originally. And the test to
> reproduce my case with the dummy is:
> - install dummy resource ocf ra and create the dummy resource as README
> [0] says
> - just watch the a) OCF logs from the dummy and b) outputs for the
> reoccurring commands:
> 
> # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1;
> sleep 20; done&
> # crm_resource --resource p_dummy --list-operations
> 
> At some point I noticed:
> - there are no more "OK" messages logged from the monitor actions,
> although according to the trace_ra dumps' timestamps, all monitors are
> still being invoked!
> 
> - at some point I noticed very strange results reported by the:
> # crm_resource --resource p_dummy --list-operations
> p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000
> (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan
> 4 14:33:07 2016, exec=62107ms): Timed Out
>   or
> p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000
> (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan  4
> 14:43:58 2016, exec=0ms): Timed Out
> 
> - according to the trace_ra dumps reoccurring monitors are being invoked
> by the intervals *much longer* than configured. For example, a 7 minutes
> of "monitoring silence":
> Mon Jan  4 14:47:46 UTC 2016
> p_dummy.monitor.2016-01-04.14:40:52
> Mon Jan  4 14:48:06 UTC 2016
> p_dummy.monitor.2016-01-04.14:47:58
> 
> Given that said, it is very likely there is some bug exist for
> monitoring multi-state clones in pacemaker!
> 
> [0] https://github.com/bogdando/dummy-ocf-ra
> 

Also note, that lrmd spawns *many* monitors like:
root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
/usr/lib/pacemaker/lrmd
root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
/bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
  \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
  \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
  \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
...

At some point, there was  already. Then I unmanaged the p_dummy but
it grew up to the 2403 after that. The number of running monitors may
grow or decrease as well.
Also, the /var/lib/heartbeat/trace_ra/dummy/ still have been populated
by new p_dummy.monitor* files with recent timestamps. Why?..

If I pkill -9 all dummy monitors, lrmd spawns another ~2000 almost
instantly :) Unless the node became unresponsive at some point. And
after restarted by power off:
# crm_resource --resource p_dummy --list-operations
p_dummy (ocf::dummy:dummy): Started (unmanaged) :
p_dummy_monitor_3 (node=node-1.test.domain.local, call=679, rc=1,
last-rc-change=Mon Jan  4 15:04:25 2016, exec=66747ms): Timed Out
or
p_dummy (ocf::dummy:dummy): Stopped (unmanaged) :
p_dummy_monitor_103000 (node=node-3.test.domain.local, call=142, rc=1,
last-rc-change=Mon Jan  4 15:14:59 2016, exec=65237ms): Timed Out

And then lrmd repeats all of the fun again.


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Dejan Muhamedagic
Hi,

On Mon, Jan 04, 2016 at 04:52:43PM +0100, Bogdan Dobrelya wrote:
> On 04.01.2016 16:36, Ken Gaillot wrote:
> > On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
> >> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
[...]
> >> Also note, that lrmd spawns *many* monitors like:
> >> root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
> >> /usr/lib/pacemaker/lrmd
> >> root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
> >> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
> >>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
> >>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
> >>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> ...
> > 
> > At first glance, that looks like your monitor action is calling itself
> > recursively, but I don't see how in your code.
> 
> Yes, it should be a bug in the ocf-shellfuncs's ocf_log().

If you're sure about that, please open an issue at
https://github.com/ClusterLabs/resource-agents/issues

Thanks,

Dejan

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org